from:"Mikhail Kuzminsky"

Re: [Beowulf] Your thoughts on the latest RHEL drama?

2023-06-30 Thread Mikhail Kuzminsky


I sat and carefully studied all the comments about RH's plans.

On Mon, Jun 26, 2023 at 02:27:23PM -0400, Prentice Bisbal via Beowulf
wrote:
Somewhere around event #3 is when I started viewing RHEL from as the 
MS of
the Linux world for obvious reasons. It seems that RH is determined 
to make
RHEL a monopoly of the "Enterprise Linux" market. Yes, I know there's 
Ubuntu
and SLES, but Ubuntu is viewed as a desktop more than a server OS 
(IMO), and

SLES hasn't really caught on, at least not in the US.


Previously, the small x86-64 clusters I supported used OpenSuSE for a
number of years, then on the next cluster I decided to switch to
CentOS 7 simply because of its large popularity in HPC and the greater
breadth of packages available.

BTW, I'm using CentOS 7 and on my home desktop (working with GNOME).
And in general, I want to have the same distribution  both at home and
in the cluster. So Ubuntu is a good starting point for me in the
future :-) (Nvidia loves Ubuntu on their GPU servers, but it's all for
AI).

But I would also like to hear your point of view about SLES / OpenSuSE
- after all, HPC Cray OS is based on SuSE.

Mikhail Kuzminsky
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] milan and rhel7

2022-06-29 Thread Mikhail Kuzminsky


In message from Michael DiDomenico  (Tue, 28
Jun 2022 17:40:09 -0400):

milan cpu's aren't officially supported on less then rhel8.3.  but
there's anecdotal evidence that rhel7 will run on milan cpu's.  if 
the

evidence is true, is anyone on the list doing so and can confirm?


Yes, RHEL requires upgrading to 8.3 or later to work with EPYC 7003
https://access.redhat.com/articles/5899941. Officially CentOS 7
doesn't support this hardware either.
You can switch to OpenSuSE - Milan support is available in 15.3

Mikhail Kuzminsky
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] likwid vs stream (after HPCG discussion)

2022-03-21 Thread Mikhail Kuzminsky


In message from Scott Atchley  (Sun, 20 Mar
2022 14:52:10 -0400):
On Sat, Mar 19, 2022 at 6:29 AM Mikhail Kuzminsky  
wrote:



If so, it turns out that for the HPC user, stream gives a more
important estimate - the application is translated by the compiler
(they do not write in assembler - except for modules from 
mathematical

libraries), and stream will give a real estimate of what will be
received in the application.



When vendors advertise STREAM results, they compile the application 
with
non-temporal loads and stores. This means that all memory accesses 
bypass
the processor's caches. If your application of interest does a random 
walk
through memory and there is neither temporal or spatial locality, 
then
using non-temporal loads and stores makes sense and STREAM 
irrelevant.


STREAM is not initially oriented to random access to memory. In this
case, memory latencies are important, and it makes more sense to get a
bandwidth estimate in the mega-sweep
(https://github.com/UK-MAC/mega-stream).
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

[Beowulf] likwid vs stream (after HPCG discussion)

2022-03-19 Thread Mikhail Kuzminsky


Just in the HPCG discussion, it was proposed to use the now widely
used likwid benchmark to estimate memory bandwidth. It gives
excellent estimates of hardware capabilities.

Am I right that likwid uses its own optimized assembler code for each
specific hardware?

If so, it turns out that for the HPC user, stream gives a more
important estimate - the application is translated by the compiler
(they do not write in assembler - except for modules from mathematical
libraries), and stream will give a real estimate of what will be
received in the application.

Mikhail Kuzminsky
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

[Beowulf] About TofuD in A64FX and Infiniband HDR

2021-10-03 Thread Mikhail Kuzminsky


My initial questions about bandwidth in a cluster:
I want to understand the bandwidth when using TofuD in A64FX or
Infiniband HDR. With HDR everything is clear - 25 GB / s for 4 links
or 75 GB / s for 12 links. But PCIe-v3 x16 in A64FX will keep
bandwidth.

Everything is more cunning with TofuD. Each link has 2 lanes, 28.05
Gbps x 2 gives about 7 GB / s (more really 6.8 GB / s).
6 TNI or 6 links give a total of 40.8 GB / s for injection - that's
more than 25 GB / s in 4x HDR. This is for 2.2 GHz, I understand that
at 1.8 GHz all numbers will decrease accordingly.
If I calculate above correctly, then:

Can I get close to 40.8 GB / s in a simple MPI PUT to another node -
or will the limit be 6.8 GB / s?
Then HDR will give more bandwidth (in Ookami w/Infiniband: 19.4 GB / s
as maximum for OSU MPI).
Ookami works w/Infiniband, not TofuD because of such bandwidth for a
not very large cluster - or because of financial reasons (cost of
TofuD routers?) ?

Mikhail Kuzminsky

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] AMD and AVX512

2021-06-20 Thread Mikhail Kuzminsky

I apologize - I should have written earlier, but I don't always work
with my broken right hand. It seems to me that a reasonable basis for
discussing AMD EPYC performance could be the specified performance
data in the Daresburg University benchmark from M.Guest. Yes, newer
versions of AMD EPYC and Xeon Scalable processors have appeared since
then, and new compiler versions. However, Intel already had AVX-512
support, and AMD - AVX-256.
Of course, peak performanceis is not so important as application
performance. There are applications where performance is not limited
to working with vectors - there AVX-512 may not be needed. And in AI
tasks, working with vectors is actual - and GPUs are often used there.
For AI, the Daresburg benchmark, on the other hand, is less relevant.
And in Zen 4, AMD seemed to be going to support 512 bit vectors. But
performance of linear algebra does not always require work with GPU.
In quantum chemistry, you can get acceleration due to vectors on the
V100, let's say a 2 times - how much more expensive is the GPU?
Of course, support for 512 bit vectors is a plus, but you really need
to look to application performance and cost (including power
consumption). I prefer to see to the A64FX now, although there may
need to be rebuild applications. Servers w/A64FX sold now, but the
price is very important.

In message from John Hearns  (Sun, 20 Jun 2021
06:38:06 +0100):
Regarding benchmarking real world codes on AMD , every year Martyn 
Guest

presents a comprehensive set of benchmark studies to the UK Computing
Insights Conference.
I suggest a Sunday afternoon with the beverage of your choice is a 
good
time to settle down and take time to read these or watch the 
presentation.

2019
https://www.scd.stfc.ac.uk/SiteAssets/Pages/CIUK-2019-Presentations/Martyn_Guest.pdf

2020 Video session
https://ukri.zoom.us/rec/share/ajvsxdJ8RM1wzpJtnlcypw4OyrZ9J27nqsfAG7eW49Ehq_Z5igat_7gj21Ge8gWu.78Cd9I1DNIjVViPV?startTime=1607008552000

Skylake / Cascade Lake / AMD Rome

The slides for 2020 do exist - as I remember all the slides from all 
talks

are grouped together, but I cannot find them.
Watch the video - it is an excellent presentation.

On Sat, 19 Jun 2021 at 16:49, Gerald Henriksen  
wrote:

On Wed, 16 Jun 2021 13:15:40 -0400, you wrote:

>The answer given, and I'm
>not making this up, is that AMD listens to their users and gives the
>users what they want, and right now they're not hearing any demand 
for

>AVX512.
>
>Personally, I call BS on that one. I can't imagine anyone in the HPC
>community saying "we'd like processors that offer only 1/2 the 
floating

>point performance of Intel processors".

I suspect that is marketing speak, which roughly translates to not
that no one has asked for it, but rather requests haven't reached a
threshold where the requests are viewed as significant enough.

> Sure, AMD can offer more cores,
>but with only AVX2, you'd need twice as many cores as Intel 
processors,

>all other things being equal.

But of course all other things aren't equal.

AVX512 is a mess.

Look at the Wikipedia page(*) and note that AVX512 means different
things depending on the processor implementing it.

So what does the poor software developer target?

Or that it can for heat reasons cause CPU frequency reductions,
meaning real world performance may not match theoritical - thus 
easier

to just go with GPU's.

The result is that most of the world is quite happily (at least for
now) ignoring AVX512 and going with GPU's as necessary - particularly
given the convenient libraries that Nvidia offers.

> I compared a server with dual AMD EPYC >7H12 processors (128)
> quad Intel Xeon 8268 >processors (96 cores).

> From what I've heard, the AMD processors run much hotter than the 
Intel
>processors, too, so I imagine a FLOPS/Watt comparison would be even 
less

>favorable to AMD.

Spec sheets would indicate AMD runs hotter, but then again you
benchmarked twice as many Intel processors.

So, per spec sheets for you processors above:

AMD - 280W - 2 processors means system 560W
Intel - 205W - 4 processors means system 820W

(and then you also need to factor in purchase price).

>An argument can be made that for calculations that lend themselves 
to
>vectorization should be done on GPUs, instead of the main processors 
but

>the last time I checked, GPU jobs are still memory is limited, and
>moving data in and out of GPU memory can still take time, so I can 
see
>situations where for large amounts of data using CPUs would be 
preferred

>over GPUs.

AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3,
which may or may not mean a difference.

But what despite all of the above and the other replies, it is AMD 
who

has been winning the HPC contracts of late, not Intel.

* - https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin 
Computing

To

[Beowulf] GPUs Nvidia C2050 w/OpenMP 4.5 in cluster

2019-08-12 Thread Mikhail Kuzminsky

Heterogeneous nodes in my small CentOS 7 cluster have  x86-64 CPUs 
along with the old Nvidia GPU C2050 (Fermi). New Fortran program uses 
MPI + OpenMP software.
Does the modern gfortran or Intel ifort compilers give support of work 
through OpenMP 4.5 with these GPUs?


Mikhail Kuzminsky,
Zelinsky Institute of Organic Chemistry
Moscow

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Fortran is Awesome

2018-12-02 Thread Mikhail Kuzminsky


I believe that the rationality of FORTRAN using is and now very much
dependent on the application. In quantum chemistry, where I previously
programmed, as also in computational chemistry in general, Fortran
remains the main language.

Yes, C is dangerous.  You can break your code in ever so many ways if 
you code with less than discipline and knowledge and great care. 

This may mean that in some cases write Fortran program can be easier
and therefore faster than in C.

Hell, at my age I may never write serious C applications ever again, 
but if I write ANYTHING
that requires a compiler, its going to be in C. 


I haven't been programming in quantum chemistry for a very long time.
But recently I wrote a tiny program for the task of computational
chemistry - and I did it in Fortran :-)

Mikhail
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Oh.. IBM eats Red Hat

2018-10-30 Thread Mikhail Kuzminsky


There are probably several reasons for this capture - they can be from
both IBM and Red Hat. And it is very difficult to discuss it now, when
it is not clear how events will develop in the future.
But it’s much better to join Red Hat to IBM than if Microsoft got
involved :-)).

I don't know what to make of systemd as a design decision.  I'm an 
Old

Guy, so by definition I grew up with init and the classic Unix OS
structure -- I still have all of the books in my office, sadly at 
least
semi-obsolete within the current kernels and linux layout.  


I worked a set of years on IBM mainframe w/MVS OS. I hope that IBM is
not a bad choice for Red Hat. It is possible to say also
about xCAT developed by IBM.

Mikhail Kuzminsky
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] About Torque maillist

2018-08-21 Thread Mikhail Kuzminsky



Does anyone know, does it work  now torque maillist (after switching 
to only commercial
Adaptivecomputing software)  - earlier 
torqueus...@clusterresources.com ?


Mikhail Kuzminsky,
Zelinsky Institute of Organic Chemistry
Moscow

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] batch systems connection

2018-05-28 Thread Mikhail Kuzminsky


Sorry, may be my question is not exactly for our Beowulf maillist.
I have 2 small OpenSuSE-based clusters using different batch systems,
and want to connect them "grid-like", via CREAM (Computing Resource 
Execution And Management)

service (I may add
also one additional common server for both clusters).

But there is no CREAM binary RPMs for OpenSuSE (Only for CentOS7/SL6
on UMD site
//repository.egi.eu/2018/03/14/release-umd-4-6-1/). I did not find:
where I can download source text of CREAM software ?

Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] cursed (and perhaps blessed) Intel microcode

2018-03-24 Thread Mikhail Kuzminsky

In message from Mark Hahn <h...@mcmaster.ca> (Fri, 23 Mar 2018 
16:02:12 -0400 (EDT)):

There *is* an updated microcode data file:
https://downloadcenter.intel.com/product/873/Processors
which seems to correspond to the document above


How I believe, it's for correction of general defect practically for 
all CPUs.
Then microcode update may give also decrease of performance. May be 
Intel have idea to improve this update and do not recommend use this 
version ?


Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Intel kills Knights Hill, Xeon Phi line "being revised"

2017-12-20 Thread Mikhail Kuzminsky


Rumours flying that the Xeon Phi family is in jeopardy, but the
article has an addendum to say:

# [Update: Intel denies they are dropping the Xeon Phi line,
# saying only that it has "been revised based on recent
# customer and overall market needs."]



This should cause some confusion.



While Knights Hill was cancelled, Intel has quietly put information
about Knights Mill online as the next Phi product line:



https://www.anandtech.com/show/12172/intel-lists-knights-mill-xeon-phi-on-ark-up-to-72-cores-at-320w-with-qfma-and-vnni


I partially disagree with "confusion". It's simple because KNM has 
minimal microarchitecture changes vs KNL, and does not focus on normal 
DP-precision. KNM focuses on SP etc, and is oriented to Deep Learning, 
AI etc.


Mikhail Kuzminsky,
Zelinsky Institute of Organic Chemistry,
Moscow

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Intel kills Knights Hill, Xeon Phi line "being revised"

2017-11-16 Thread Mikhail Kuzminsky



Unfortunately I did not find the english version, but Andreas 


Essentially yes Xeon Phi is not continued, but a new design called 
Xeon-H is coming. 

Yes, and Xeon-H has close to KNL codename - Knights Cove. May be some 
important (for HPC) microarchitecture features will remain.

But in any case stop of Xeon Phi give pluses for new NEC SX-Aurora.

Mikhail Kuzminsky

Zelinsky Institute
of Organic Chemistry
Moscow
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] slurm in heterogenous cluster

2017-09-18 Thread Mikhail Kuzminsky


In message from Christopher Samuel  (Mon, 18
Sep 2017 16:03:47 +1000):

...
The best info is in the "Upgrading" section of the Slurm quickstart 
guide:


https://slurm.schedmd.com/quickstart_admin.html

...
So basically you could have (please double check this!):

slurmdbd: 17.02.x
slurmctld: 17.02.x
slurmd: 17.02.x & 16.05.x & 15.08.x
... 


Thank you very much !
I hope than modern major slurm versions will be succesfully translated
and builded also w/old Linux distributions
(for example, w/2.6 kernel).

Yours
Mikhail
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] slurm in heterogenous cluster

2017-09-17 Thread Mikhail Kuzminsky

Is it possible to use diffenent slurm versions on different worker
nodes of cluster (w/other slurmctld and slurmdbd versions on head
node) ?

If this is possible in principle (to use different slurmd versions on
different worker nodes), what are the most important restrictions for
this?

 Mikhail Kuzminsky,
Zelinsky Institute of Organic Chemistry RAS,
Moscow


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Register article on Epyc

2017-06-25 Thread Mikhail Kuzminsky


Returning to the first message:
> Also regarding compute power, it would be interesting to see a comparison
> of a single socket of these versus Xeon Phi rather than -v4 or -v5 Xeon.

I partially disagree w/general discussion direction. AMD Epyc looks as 
excellent CPUs for datacenters. But if we say about Beowulf and HPC, we must 
start 1st of all not from SPECfp_rate, but simple from FLOPS per cycle for 
core, or somes like Linpack, dgemm or like other tests.

OK, it's known that Zen core support AVX2 only via 128 bits base, and gives 
only 8 DP FLOPS per cycle (see
http://www.linleygroup.com/mpr/article.php?id=11666
or
https://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.23-Tuesday-Epub/HC28.23.90-High-Perform-Epub/HC28.23.930-X86-core-MikeClark-AMD-final_v2-28.pdf

Broadwell core gives 16 FLOPS/cycle, and Skylake-SP 32 FLOPS/cycle w/AVX512. 
Therefore SPECfp_rate2006 may be good for Epyc 7601 because of 32 cores per CPU 
instead of 22 cores for Broadwell Xeon E5-2699A v4. Xeon Phy KNL cores also 
gives 32 DP FLOPS per cycle. By my opinion, it's necessary to wait results of 
normal HPC tests. 

Mikhail Kuzminsky
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] more automatic building

2016-09-30 Thread Mikhail Kuzminsky


Many thanks for all answers ! 
It looks for me now that OpenHPC choice may will be best for me. 
1 of my 2 existing clusters is based on RH, 2nd - on OpenSuSE. Basing of 
OpenHPC on repositories is plus for me. Plus support of 
mvapich2/OpenMPI/intelMPI (I don't know about basical MPICH).  Etc.
Are there, by your opinions, some clear OpenHPC minuses ?

Mikhail


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] more automatic building

2016-09-28 Thread Mikhail Kuzminsky

I worked always w/very small HPC clusters and built them manually (each 
server). 
But what is reasonable to do for clusters  containing some tens or hundred of 
nodes ?
Of course w/modern Xeon (or Xeon Phi KNL) and IB EDR, during the next year for 
example. 
There are some automatic systems like OSCAR or even ROCKS. 

But it looks that ROCKS don't support modern interconnects, and there may be 
problems 
w/OSCAR versions for support of systemd-based distributives like CentOS 7. For 
next year -
is it reasonable to wait new OSCAR version or something else ?

Mikhail Kuzminsky,
Zelinsky Institute of Organic Chemistry RAS,
Moscow  



___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Thoughts on IB EDR and Intel OmniPath

2016-05-01 Thread Mikhail Kuzminsky




> 
from :Joe Landman <land...@scalableinformatics.com>:
>
>
> 
Even with RoCE2, some of the testing we did 
>demonstrated very significant congestion related slowdowns that we 
>couldn't easily tune for (with PFC and other bits that RoCE needs).
>
>I've used iWARP in the dim and distant past, and it was much better than 
>plain old gigabit on the same systems (with Ammasso cards). -- BTW, this gives 
>the question about choice between RoCE vs iWARP. Does your "Even with RoCE2" 
>means that iWARP is more bad than RoCE ?  

Mikhail Kuzminsky,
Zelinsky Institute of Organic Chemistry RAS,
Moscow
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] modern batch management systems

2015-11-10 Thread Mikhail Kuzminsky

 In our  more old clusters we used PBS and SGE as batch systems to run 
quantum-chemical applications. Now there are also commercial versions PBS Pro 
and Oracle Grid Engine, and other commercial batch management programs. But we 
are based on free open-source batch management systems.
Which free (and potentially free in a few years) batch systems you recommend ?

Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry RAS
Moscow




Mikhail Kuzminsky___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Haswell as supercomputer microprocessors

2015-08-04 Thread Mikhail Kuzminsky

 By my opinion, PowerPC A2 more exactly should be used as name for *core*, not 
for IBM  BlueGene/Q *processor chip*.
Power BQC name is used in TOP500, GREEN500, in a lot of Internet data, in IBM 
journal - see:

Sugavanam K. et al. Design for low power and power management in IBM Blue 
Gene/Q //IBM Journal of Research and 
Development. – 2013. –v. 57. – №. 1/2. – p. 3: 1-3: 11.

PowerPC A2 is the core, see //en.wikipedia.org/wiki/Blue_Gene
 
//en.wikipedia.org/wiki/PowerPC A2

Mikhail



___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Haswell as supercomputer microprocessors

2015-08-03 Thread Mikhail Kuzminsky

 New special supercomputer microprocessors (like IBM Power BQC and Fujitsu 
SPARC64 XIfx) have 2**N +2 cores (N=4 for 1st, N=5 for 2nd), where 2 last cores 
are redundant, not for computations, but only for other work w/Linux or even 
for replacing of failed computational core.

Current Intel Haswell E5 v3 may also have 18 = 2**4 +2 cores.  Is there some 
sense to try POWER BQC or SPARC64 XIfx ideas (not exactly), and use only 16 
Haswell cores for parallel computations ? If the answer is yes, then how to 
use this way under Linux ?

Mikhail Kuzminsky, 
Zelinsky Institute of Organic Chemistry RAS,
Moscow




___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] sorry

2013-09-30 Thread Mikhail Kuzminsky

I apologize again for erroneous setting of date field in mailer used some years 
ago.

Mikhail Kuzminsky
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Supermicro BIOS error (was Nvidia K20 + Supermicro mobo)

2013-08-15 Thread Mikhail Kuzminsky

Previously I described here the situation w/GPU K20c on Supermicro X9SCA-F 
mobo, where NVIDIA driver v.319.32 (last version) can't be installed.

NVIDIA wrote me, that it's Supermicro board (BIOS) error: BIOS don't allocate 
memory (via BAR registers) for driver.

We found that this erroneoues situation is absent on Supermicro board 
X8-series, and on ASUS board - driver was installed successfully on OpenSUSE 
12.3 (and 11.4 also); nvidia-smi utility works normally.

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] PCI configuration space errors ? (was Nvidia K20 + Supermicro mobo)

2013-07-22 Thread Mikhail Kuzminsky

Let me try to forgot (to distract from) GPUs. I don't know, who setup BARs 
for PCI-E devices: BIOS or Linux kernel (OpenSUSE 12.3 kernel 3.7.10-1.1 - in 
my case). Here (below) is presented part of /var/log/messages, but at the 
corresponding moment of kernel loading there is no Nvidia GPU driver loaded 
(PCI 01:00.0)

---from /var/log/messages--
2013-07-21T02:28:58.348552+04:00 c6ws4 kernel: [0.432261] ACPI: ACPI bus 
type pnp unregistered
2013-07-21T02:28:58.348554+04:00 c6ws4 kernel: [0.438011] pci :00:01.0: 
BAR 15: can't assign mem pref (size 0x1800)
2013-07-21T02:28:58.348555+04:00 c6ws4 kernel: [0.438015] pci :00:01.0: 
BAR 14: assigned [mem 0xd100-0xd1ff]
2013-07-21T02:28:58.348555+04:00 c6ws4 kernel: [0.438018] pci :01:00.0: 
BAR 1: can't assign mem pref (size 0x1000)
2013-07-21T02:28:58.348556+04:00 c6ws4 kernel: [0.438020] pci :01:00.0: 
BAR 3: can't assign mem pref (size 0x200)
2013-07-21T02:28:58.348557+04:00 c6ws4 kernel: [0.438023] pci :01:00.0: 
BAR 0: assigned [mem 0xd100-0xd1ff]
2013-07-21T02:28:58.348558+04:00 c6ws4 kernel: [0.438026] pci :01:00.0: 
BAR 6: can't assign mem pref (size 0x8)
2013-07-21T02:28:58.348558+04:00 c6ws4 kernel: [0.438028] pci :00:01.0: 
PCI bridge to [bus 01]
2013-07-21T02:28:58.348559+04:00 c6ws4 kernel: [0.438031] pci :00:01.0: 
  bridge window [mem 0xd100-0xd1ff]
2013-07-21T02:28:58.348561+04:00 c6ws4 kernel: [0.438035] pci :00:1c.0: 
PCI bridge to [bus 02]
-

Of course, there is much more than 2 PCI devices in the system (based on 
Supermicro X9SCA-F, last BIOS v.2.0b), but only for 2 of them exist such BAR 
error messages: for PCI Bridge (00:01.0, Xeon E3-1230 PCI-E port) and for 
Nvidia/PNY K20c at 01:00.0.

Does this means some BIOS problems - or it's result of absence of loaded nvidia 
driver  ?

The BAR error messages above are presented independently of BIOS/PCI settings - 
a) 4G decoding enabled/disabled b) is PCI-E Gen.2 mode forced (instead of 
Gen.3) or no.

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Nvidia K20 + Supermicro mobo

2013-07-17 Thread Mikhail Kuzminsky

 Adam DeConinck ajde...@ajdecon.org wrote :

 I've seen similar messages on CentOS when the Nouveau drivers are
 loaded and a Tesla K20 is installed. You should make sure that nouveau
 is blacklisted so the kernel won't load it.
 
 Note that it hasn't always been enough for me to have nouveau listed
 in /etc/modprobe.d/blacklist; sometimes I've had to actually put
 rdblacklist=nouveau on the kernel line.
nouveau driver loading is suppressed via /etc/modprobe.d .
lsmod don't show the presence of nouveau module; therefore I hope that 
rdblacklist as kernel parameter is not necessary.

First group of kernel messages about BARs are presented BEFORE I start nvidia 
driver installations, and I think that my corresponding question  doesn't 
depends from driver installation, and, in particular, from nouveau.

Mikhail  

 
 Disclaimer: I work at NVIDIA, but I haven't touched OpenSUSE in forever.
 
 Cheers,
 Adam
 
 On Tue, Jul 16, 2013 at 10:29 AM, Mikhail Kuzminsky mikk...@mail.ru wrote:
  I want to test NVIDIA GPU (PNY Tesla K20c) w/our own application for future 
  using in our cluster. But I found problems w/NVIDIA driver (v.319.32) 
  installation (OpenSUSE 12.3, kernel 3.7.10-1.1).
 
  1st of all, before start of driver installation I've strange for me 
  messages about BAR registers:
  ---from /var/log/messages--
  2013-07-04T01:43:43.666022+04:00 c6ws4 kernel: [ 0.421559] pci 
  :00:01.0: BAR 15: can't assign mem pref (size 0x1800)
  2013-07-04T01:43:43.666024+04:00 c6ws4 kernel: [ 0.421563] pci 
  :00:01.0: BAR 14: assigned [mem 0xe100-0xe1ff]
  2013-07-04T01:43:43.666025+04:00 c6ws4 kernel: [ 0.421566] pci 
  :00:16.1: BAR 0: assigned [mem 0xe0001000-0xe000100f 64bit]
  2013-07-04T01:43:43.666026+04:00 c6ws4 kernel: [ 0.421576] pci 
  :01:00.0: BAR 1: can't assign mem pref (size 0x1000)
  2013-07-04T01:43:43.666027+04:00 c6ws4 kernel: [ 0.421579] pci 
  :01:00.0: BAR 3: can't assign mem pref (size 0x200)
  2013-07-04T01:43:43.666027+04:00 c6ws4 kernel: [ 0.421581] pci 
  :01:00.0: BAR 0: assigned [mem 0xe100-0xe1ff]
  2013-07-04T01:43:43.666028+04:00 c6ws4 kernel: [ 0.421584] pci 
  :01:00.0: BAR 6: can't assign mem pref (size 0x8)
  2013-07-04T01:43:43.666029+04:00 c6ws4 kernel: [ 0.421586] pci 
  :00:01.0: PCI bridge to [bus 01]
  ---
 
  May be it's hardware/BIOS (Supermicro X9SCA-F, last BIOS v.2.0b) error 
  symptoms ? I tried both BIOS modes - above 4G Decoding enabled and 
  disabled.
 
  It looks for me that NVIDIA driver uses BAR 1 (see below). Although it was 
  also some unclear for me messages in nvidia-installer.log, installer shows 
  that kernel interface of nvidia.ko was compiled, but then 
  nvidia-installer.log contains
 
  --from nvidia-installer.log 
  --
  - Kernel module load error: No such device
  - Kernel messages:
  ...[ 25.286079] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
  [ 1379.760532] nvidia: module license 'NVIDIA' taints kernel.
  [ 1379.760536] Disabling lock debugging due to kernel taint
  [ 1379.765158] nvidia :01:00.0: enabling device (0140 - 0142)
  [ 1379.765165] NVRM: This PCI I/O region assigned to your NVIDIA device is 
  invalid:
  [ 1379.765165] NVRM: BAR1 is 0M @ 0x0 (PCI::01:00.0)
  [ 1379.765166] NVRM: The system BIOS may have misconfigured your GPU.
  [ 1379.765169] nvidia: probe of :01:00.0 failed with error -1
  [ 1379.765177] NVRM: The NVIDIA probe routine failed for 1 device(s).
  [ 1379.765178] NVRM: None of the NVIDIA graphics adapters were initialized!
  -
 
  I add also lspci -v extraction :
 
  01:00.0 3D controller: NVIDIA Corporation GK107 [Tesla K20c] (rev a1)
  Subsystem: NVIDIA Corporation Device 0982
  Flags: fast devsel, IRQ 11
  Memory at e100 (32-bit, non-prefetchable) [disabled] [size=16M]
  Memory at unassigned (64-bit, prefetchable) [disabled]
  Memory at unassigned (64-bit, prefetchable) [disabled]
 
  Does this kernel messages above means that I have hardware/BIOS problems or 
  it may be some NVIDIA driver problems ?
 
  Mikhail Kuzminsky
  Computer Assistance to Chemical Research Center
  Zelinsky Institute of Organic Chemistry
  Moscow
 
 
 
 
 
  ___
  Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
  To change your subscription (digest mode or unsubscribe) visit 
  http://www.beowulf.org/mailman/listinfo/beowulf
 ___
 Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
 To change your subscription (digest mode or unsubscribe) visit 
 http://www.beowulf.org/mailman/listinfo/beowulf
 

Mikhail Kuzminsky

[Beowulf] Nvidia K20 + Supermicro mobo

2013-07-16 Thread Mikhail Kuzminsky

I want to test NVIDIA GPU (PNY Tesla K20c) w/our own application for future 
using in our cluster. But I found problems w/NVIDIA driver (v.319.32) 
installation (OpenSUSE 12.3, kernel 3.7.10-1.1).
 
1st of all, before start of driver installation I've strange for me messages 
about BAR registers: 
---from /var/log/messages--
2013-07-04T01:43:43.666022+04:00 c6ws4 kernel: [ 0.421559] pci :00:01.0: 
BAR 15: can't assign mem pref (size 0x1800)
2013-07-04T01:43:43.666024+04:00 c6ws4 kernel: [ 0.421563] pci :00:01.0: 
BAR 14: assigned [mem 0xe100-0xe1ff]
2013-07-04T01:43:43.666025+04:00 c6ws4 kernel: [ 0.421566] pci :00:16.1: 
BAR 0: assigned [mem 0xe0001000-0xe000100f 64bit]
2013-07-04T01:43:43.666026+04:00 c6ws4 kernel: [ 0.421576] pci :01:00.0: 
BAR 1: can't assign mem pref (size 0x1000)
2013-07-04T01:43:43.666027+04:00 c6ws4 kernel: [ 0.421579] pci :01:00.0: 
BAR 3: can't assign mem pref (size 0x200)
2013-07-04T01:43:43.666027+04:00 c6ws4 kernel: [ 0.421581] pci :01:00.0: 
BAR 0: assigned [mem 0xe100-0xe1ff]
2013-07-04T01:43:43.666028+04:00 c6ws4 kernel: [ 0.421584] pci :01:00.0: 
BAR 6: can't assign mem pref (size 0x8)
2013-07-04T01:43:43.666029+04:00 c6ws4 kernel: [ 0.421586] pci :00:01.0: 
PCI bridge to [bus 01]
---

May be it's hardware/BIOS (Supermicro X9SCA-F, last BIOS v.2.0b) error symptoms 
? I tried both BIOS modes - above 4G Decoding enabled and disabled. 

It looks for me that NVIDIA driver uses BAR 1 (see below). Although it was also 
some unclear for me messages in nvidia-installer.log, installer shows that 
kernel interface of nvidia.ko was compiled, but then nvidia-installer.log 
contains 

--from nvidia-installer.log 
--
- Kernel module load error: No such device
- Kernel messages:
...[ 25.286079] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[ 1379.760532] nvidia: module license 'NVIDIA' taints kernel.
[ 1379.760536] Disabling lock debugging due to kernel taint
[ 1379.765158] nvidia :01:00.0: enabling device (0140 - 0142)
[ 1379.765165] NVRM: This PCI I/O region assigned to your NVIDIA device is 
invalid:
[ 1379.765165] NVRM: BAR1 is 0M @ 0x0 (PCI::01:00.0)
[ 1379.765166] NVRM: The system BIOS may have misconfigured your GPU.
[ 1379.765169] nvidia: probe of :01:00.0 failed with error -1
[ 1379.765177] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 1379.765178] NVRM: None of the NVIDIA graphics adapters were initialized!
-

I add also lspci -v extraction :

01:00.0 3D controller: NVIDIA Corporation GK107 [Tesla K20c] (rev a1)
Subsystem: NVIDIA Corporation Device 0982
Flags: fast devsel, IRQ 11
Memory at e100 (32-bit, non-prefetchable) [disabled] [size=16M]
Memory at unassigned (64-bit, prefetchable) [disabled]
Memory at unassigned (64-bit, prefetchable) [disabled]
   
Does this kernel messages above means that I have hardware/BIOS problems or it 
may be some NVIDIA driver problems ?

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow





___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Strange resume statements generated for GRUB2

2013-06-10 Thread Mikhail Kuzminsky

  Skylar Thompson skylar.thomp...@gmail.com wrote:

Hibernation isn't strictly suspension - it's writing all allocated,
non-file-backed portions of memory to the paging/swap space. When the
system comes out of hibernation, it boots normally and then looks for a
hibernation image in the paging space. If it finds one, it loads that
back into system memory rather than proceeding with a regular boot. This
is in contract to system suspension, which depends on hardware support
to place CPU, memory, and other system devices into a low power state,
and wait for a signal to power things back up, bypassing the boot process.

Taking into account small size of my swap partition (4GB only), less than my 
RAM size,
(I wrote about this situation in my 1st message) the hibernation image may not 
fit into swap partition. Therefore coding of -part2 (for /) in resume statement 
is preferred (right for general case).    

I'm not a SuSE expert so I'm not sure what YaST is doing, but I imagine
you have to make grub changes via YaST rather than editing the grub
configs directly.
Skylar
Generally speaking, you are right. But I myself strongly prefer to know what 
occurs at linux level - to have the natural possibility (enough knowledge) to 
work w/OpenSUSE, Fedora etc.  
So I prefer to change GRUB2 configuration files :-)

Mikhail

On 06/09/2013 11:37 AM, Mikhail Kuzminsky wrote:
 I have swap in sda1 and / in sda2 partitions of HDD. At installation
 of OpenSUSE 12.3 (where YaST2 is used) on my cluster node I found
 erroneous, by my opinion, boot loader (GRUB2) settings.

 YaST2 proposed (at installation) to use
 ... resume=/dev/disk/by-id/ata-WDC-... -part1 splash=silent ...

 in configuration of GRUB2. This parameters are transmitted (at linux
 loading) by GRUB2 to linux kernel. GRUB2 itself, according my
 installation settings, was installed to MBR. I changed (at installation
 stage) -part1 to -part2, but after that YaST2 restored it back to -
 part1 value !
 And after installation OpenSuSE boots successfully !
 I found (in installed OpenSuSE) 2 GRUB2 configuration files w/erroneous
 -part1 setting.

 I found possible interpretation of this behaviour in /var/log/messages.
 I found in this file the strings:
 [Kernel] PM: Checking hibernation image partition
 /dev/disk/by-id/ata-WDC_...-part1

 [Kernel] PM: Hibernation Image partition 8:1 present
 [Kernel] PM: Looking for hibernation image.
 {Kernel] PM: Image not found (code -22)
 [Kernel] PM: Hibernation Image partitions not present or could not be
loaded

 What does it means ? The hibernation image is writing to swap partition
 ? But I beleive that hibernation is really suppressed in my Linux
 (cpufreq kernel modules are not loaded) , and my BIOS settings do not
 allow any changes of CPU frequency. BTW, my swap partition is small (4
 GB, but RAM size is 8 GB).

 Which GRUB2/resume settings are really right and why they are right ?


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Prevention of cpu frequency changes in cluster nodes (Was : cpupower, acpid cpufreq)

2013-06-09 Thread Mikhail Kuzminsky

 I installed OpenSuSE 12.3/x86-64 now. I may now say about the reasons why I am 
afraid of loading of cpufreq modules.

1) I found in /var/log/messages pairs of strings about governor like

[kernel] cpuidle: using governor ladder
[kernel] cpuidle: using governor menu

and strange for me
[kernel] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[kernel] ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8)

2) The presence on installed system of /sys/devices/system/cpu/cpufreq
  /sys/devices/system/cpu/cpu0/cpuidle
 directories. cpuidle directories contains state0, state1 etc directories 
w/non-empty files.

3) But to prevent  cpu frequency changes I suppressed all like possibilities in 
BIOS.
4) And I don't have (as I wrote in my previous Beowulf message) 
/sys/devices/system/cpu/cpu0/cpufreq files.

Just the presence of this file is used by my /etc/init.d/cpufreq script as test 
of needs to load cpufreq kernel modules.

5) lsmod says that there is no cpufreq modules loaded.
Any comments ? Am I everywhere here right and should I ignore my afraids about 
kernel messages and presence of some /sys/devices/system/cpu/.. files ?

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center RAS
Zelinsky Institute of Organic Chemistry
Moscow



 


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Strange resume statements generated for GRUB2

2013-06-09 Thread Mikhail Kuzminsky

 I have swap in sda1 and / in sda2 partitions of HDD. At installation of 
OpenSUSE 12.3 (where YaST2 is used) on my cluster node I found erroneous, by my 
opinion, boot loader (GRUB2) settings.

YaST2 proposed (at installation) to use 
... resume=/dev/disk/by-id/ata-WDC-... -part1 splash=silent ...

in configuration of GRUB2. This parameters are transmitted (at linux loading) 
by GRUB2 to linux kernel. GRUB2 itself, according my installation settings, was 
installed to MBR. I changed (at installation stage) -part1 to -part2, but after 
that YaST2 restored it back to - part1 value ! 
And after installation OpenSuSE boots successfully ! 
I found (in installed OpenSuSE) 2 GRUB2 configuration files w/erroneous -part1 
setting.

I found possible interpretation of this behaviour in /var/log/messages. I found 
in this file the strings:
[Kernel] PM: Checking hibernation image partition 
/dev/disk/by-id/ata-WDC_...-part1

[Kernel] PM: Hibernation Image partition 8:1 present
[Kernel] PM: Looking for hibernation image.
{Kernel] PM: Image not found (code -22)
[Kernel] PM: Hibernation Image partitions not present or could not be loaded

What does it means ? The hibernation image is writing to swap partition ?  But 
I beleive that hibernation is really suppressed in my Linux (cpufreq kernel 
modules are not loaded) , and my BIOS settings do not allow any changes of CPU 
frequency. BTW, my swap partition is small (4 GB, but RAM size is  8 GB).

Which GRUB2/resume settings are really right and why they are right ?

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center RAS
Zelinsky Institute of Organic Chemistry
Moscow






___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] cpupower, acpid cpufreq

2013-06-05 Thread Mikhail Kuzminsky


I plan to perform Linux installation (openSuSE 12.3/x86-64, kernel 3.7.10) for 
HPC-cluster. I want not to change CPU frequencies in cluster nodes, and not to 
use cpufreq kernel modules.  Therefore I also don't want to use special power 
saving states of CPUs. 

I performed fast test installation, and, of course, 
/lib/modules/`uname-r`/kernel/drivers/cpufreq directory is presented, but no 
cpufreq kernel modules are loaded, and /sys/devices/system/cpu/cpu0 e.a. do not 
have cpufreq files. But generally - must I perform special steps to avoid 
cpufreq modules load ?

BTW, do somebody use acpid and/or cpupower RPM packages in cluster nodes ? If  
yes, why they are interesting ?  

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center RAS,
Zelinsky Institute of Organic Chemistry
Moscow

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] nVidia Kepler GK110 GPU is incompatible w/Intel x86 hardware in PCI-E 3.0 mode ?

2013-04-18 Thread Mikhail Kuzminsky

 I've cluster node (w/Linux, of course) based on Supermicro X9SCA system board 
and Xeon E3-1230v2 having LGA1155 socket. Now I want to buy GPU nVidia Kepler 
GK110 w/PCI-E 3.0 (CK20 Compute Board from PNY ?) and install it into my node.  
Intel Xeon E3-1230v2 and Supermicro X9SCA both support PCI-E 3.0.

But I've heard that GK110 (available today on the market) can't works in PCI-E 
v.3.0 mode w/Intel Equipment (only in PCI-E 2.0 mode). But GK104 etc, good for 
SP only, works w/PCI-E 3.0 normally. Moreover, for successfull work of GK110 
w/Intel Xeon Platform I need:
a) to buy next version of Xeon processor which will have new socket (when it'll 
arrive to the market?)
and new PCI-E 3.0 support plus new system board for this processor (i.e. I need 
to modernize node hardware)
and b) to buy new GK110 version, which will have improved PCI-E 3.0 interface 
block (again, when it'll on the market?).

The reason, as I heard, is different realization of PCI-E 3.0 standard  by 
nVidia and Intel in the corresponding currently available hardware. This is 
result of poor detalizations of PCI-E 3.0 standard, which should be defined 
more exactly.

Can somebody clarify this situation ?

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry RAS
Moscow 




___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Quantum Chemistry scalability for large number of processors (cores)

2012-09-28 Thread Mikhail Kuzminsky

Thu, 27 Sep 2012 11:11:24 +1000 от Christopher Samuel sam...@unimelb.edu.au:
 -BEGIN PGP SIGNED MESSAGE-
 On 27/09/12 03:52, Andrew Holway wrote:
 
  Let the benchmarks begin!!!
 
 Assuming the license agreement allows you to publish them..

:-) For example: Gaussian-09/03/... licenses disallow you to publish any data 
which may harm Gaussian, Inc.
Therefore if you present speedup values which show good parallelization 
efficiency, and w/o any comparison w/other  programs, 
all will be OK.

There are also some other codes, which are free, even with GPL license. But I 
myself don't have high number of cores :-)

Mikhail

 
 - -- 
  Christopher SamuelSenior Systems Administrator
  VLSCI - Victorian Life Sciences Computation Initiative
  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
  http://www.vlsci.org.au/  http://twitter.com/vlsci
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Q: IB message rate large core counts (per node) ?

2010-02-25 Thread Mikhail Kuzminsky


BTW, is Cray SeaStar2+ better than IB - for nodes w/many cores ?
And I didn't see latencies comparison for SeaStar vs IB. 


Mikhail
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Fortran Array size question

2009-11-03 Thread Mikhail Kuzminsky

In message from Prentice Bisbal prent...@ias.edu (Tue, 03 Nov 2009
12:09:07 -0500):
This question is a bit off-topic, but since it involves Fortran
minutia,
I figured this would be the best place to ask. This code may
eventually

run on my cluster, so it's not completely off topic!

Question: What is the maximum number of elements you can have in a
double-precision array in Fortran? I have someone creating a
4-dimensional double-precision array. When they increase the
dimenions

of the array to ~200 million elements, they get this error:

compilation aborted (code 1).

I'm sure they're hitting a Fortran limit, but I need to prove it. I
haven't been able to find anything using The Google.

It is not Fortran restriction. It may be some compiler restriction.
64-bit ifort for EM64t allow you to use, for example, 400 millions

elements.

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry RAS
Moscow

--
Prentice
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin
Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

--
Ýòî ñîîáùåíèå áûëî ïðîâåðåíî íà íàëè÷èå â íåì âèðóñîâ
è èíîãî îïàñíîãî ñîäåðæèìîãî ïîñðåäñòâîì
MailScanner, è ìû íàäååìñÿ
÷òî îíî íå ñîäåðæèò âðåäîíîñíîãî êîäà.

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] nearly future of Larrabee

2009-08-24 Thread Mikhail Kuzminsky

In message from Bogdan Costescu bcoste...@gmail.com (Sun, 23 Aug 
2009 03:17:08 +0200):

2009/8/21 Mikhail Kuzminsky k...@free.net:
Q3. Does it means that Larrabee will give essential speedup also on 
relative

short vectors ?


I don't quite understand your question...



For example, will DAXPY give essential speedup (percent of peak 
performance) for N=10 or 100 for example (for matrix and vector), and 
will DGEMM give high performance for meduim sizes of matrices - or 
we'll need large N values - for example, 1000, 1 etc ?


What is about gather/scatter/etc for vector processing, the compilers 
for Cray T90/C90 ... Cray 1, NEC SX-6/5/4... performs, I beleive, all 
the necessary things.


Mikhail

Mikhail
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] nearly future of Larrabee

2009-08-21 Thread Mikhail Kuzminsky


AFAIK Larrabee-based product(s) will appear soon - at begin of 2010.
Unfortunatley I didn't see enough appropriate technical information. 
What new is known from SIGGRAPH 2009 ?


There was 2 ideas of Larrabee-based hardware a) Whole computers on 
Larrabee CPU(s)

b) GPGPU card.

Recently I didn't see any words about Larrabee-based servers - only 
about graphical cards.
If Larrabee will work as CPU - then I beleive that linux kernel 
developers will work in this direction.

But I didn't find anything about Larrabee in 2.6.
So

Q1. Is there the plans to build Larrabee-based motherboards (in 
particular in 2010) ?


If Larrabee will be in the form of graphical card (the most probable 
case) -

Q2. What will be the interface - one slot PCI-E v.2 x16 ?

It's known now, that DP will be hardware supported and (AFAIK) that 
512-bit operands (i.e. 8 DP words) will be supported in ISA.


Q3. Does it means that Larrabee will give essential speedup also on 
relative short vectors ?


And is there some preliminary articles w/estimation of Larrabee DP 
performance ? One of declared potential advantages of Larrabee is 
support by compilers.


There is now PGI Fortran w/NVidia GPGPU extensions. PGI 
Accelerator-2010 will include support of CUDA on the base of 
OpenMP-like comments to compiler. So


Q4. Is there some rumours about direct Larrabee support w/Intel ifort 
or PGI compilers in 2010 ?


(By direct I mean automatic compiler vectorization of pure 
Fortran/C source, maximim w/additional commemts).


Q5. How much may costs Larrabee-based hardware in 2010 ? I hope it'll 
be lower $1. Any more exact predictions ?


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry RAS
Moscow 
___

Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] moving of Linux HDD to other node: udev problem at boot

2009-08-20 Thread Mikhail Kuzminsky

In message from Reuti re...@staff.uni-marburg.de (Wed, 19 Aug 2009
21:07:19 +0200):

Maybe the disk id is different form the one recored in /etc/fstab.
What about using plain /dev/sda1 or alike, or mounting by volume
label?

At the moment of problem /etc/fstab, as I understand, isn't used. And
/dev/sda* files are not created by udev :-(

Mikhail

-- Reuti

and then the proposal to try again. After finish of this script I
don't see any HDDs in /dev.

BIOS setting for this SATA device is enhanced. compatible mode
gives the same result.

What may be the source of the problem ? May be HDD driver used by
initrd ?

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry RAS
Moscow
PS. If I see (after finish of udev.sh script) the content of /sys
- it's right in NUMA sense, i.e.

/sys/devices/system/node contains normal node0 and node1.

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin
Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

--
ьФП УППВЭЕОЙЕ ВЩМП РТПЧЕТЕОП ОБ ОБМЙЮЙЕ Ч ОЕН ЧЙТХУПЧ
Й ЙОПЗП ПРБУОПЗП УПДЕТЦЙНПЗП РПУТЕДУФЧПН
MailScanner, Й НЩ ОБДЕЕНУС
ЮФП ПОП ОЕ УПДЕТЦЙФ ЧТЕДПОПУОПЗП ЛПДБ.

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Re: moving of Linux HDD to other node: udev problem at boot

2009-08-20 Thread Mikhail Kuzminsky

In message from David Mathog mat...@caltech.edu (Thu, 20 Aug 2009 
11:29:17 -0700):

Mikhail Kuzminsky k...@free.net wrote:
I moved Western Digital SATA HDD w/SuSE 10.3 installed (on dual 
Barcelona server) to dual Nehalem server (master HDD on Nehalem 
server) with Supermicro X8DTi mobo.


Which means any number of drivers will have to change.  The boot 
could
only succeed if all of these new drivers are present in the distro 
AND

the installation isn't hardwired to use information from the previous
system.  The first may be true,


It was, of course, the main hope


the second is almost certainly false.


... and the second can be resolved IMHO w/o difficult problems.
On Mandriva, and probably Red Hat, and maybe Suse, even cloning 
between

identical systems requires that that the file:

 /etc/udev/rules.d/61-net_config.rules

be removed before reboot as it holds a copy of the MAC from the 
previous
system, and no two machines (should) have the same MAC even if they 
are
otherwise identical. 


SuSE have this problem, but at least 11.1 have special setting to 
avoid such udev behaviour. And updating of network settings isn't a 
problem.



There are a lot of other files in the same
directory which I believe hold similar machine specific information.
Similarly, your /etc/modprobe.conf will almost certainly load modules
which are not appropriate for the new system. 


Is there some modules which depends from processors ?
The NIC drivers isn't a problem.


If there is an /etc/sysconfig directory there may be files there that also hold 
machine specific information.  The /etc/sensors.conf configuration will also 
certainly also be incorrect.


Of course, lm_sensors and NICs settings have to be changed. But HDDs 
for example was the same (excluding size).


Perhaps you can successfully boot the system in safe mode and then 
run
whatever configuration tool Suse provides to reset all of these 
hardware
specific files? 


The problem don't depends from kind of load (safemode or usual).


David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] moving of Linux HDD to other node: udev problem at boot

2009-08-20 Thread Mikhail Kuzminsky

In message from Greg Lindahl lind...@pbm.com (Thu, 20 Aug 2009 
11:23:25 -0700):

On Thu, Aug 20, 2009 at 08:06:07PM +0200, Reuti wrote:

AFAIK, initrd (as the kernel itself) is universal for 
EM64T/x86-64,


The problem is not the type of CPU, but the chipset (i.e. the 
necessary 
kernel module) with which the HDD is accessed.


There are 2 aspects to this:

1: /etc/modprobe.conf or equivalent
2: the initrd on a non-rescue disk is generally specialized to only
include modules for devices in (1).

Solution? Boot in a rescue disk, chroot to your system disk, modify
/etc/modprobe.conf appropriately, run mkinitrd.


Thanks, it's good idea ! The problem is (I think) just in 10.3 initrd 
image.
Unfortunately it's in some inconsistence w/my source hope - move HDD 
ASAP (As Simple As Possible :-)) ).


Mikhail


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] moving of Linux HDD to other node: udev problem at boot

2009-08-19 Thread Mikhail Kuzminsky

As it was discussed here, there are NUMA problems w/Nehalem on a set 
of Linux distributions/kernels. I was informed that may be old 
OpenSuSE 10.3 default kernel (2.6.22) works w/Nehalem OK in the sense 
of NUMA, i.e. gives right /sys/devices/system/node content.


I moved Western Digital SATA HDD w/SuSE 10.3 installed (on dual 
Barcelona server) to dual Nehalem server (master HDD on Nehalem 
server) with Supermicro X8DTi mobo.


But loading of SuSE 10.3 on Nehalem server was not successful. Grub 
loader (which menu.lst configuration uses by-id identification of 
disk partitions) works OK. But linux kernel booting didn't finish 
successfully: /boot/04-udev.sh script (which task is udev 
initialization) - I think, it's from initrd content -  do not see root 
partition (1st partition on HDD) ! 


At the boot I see the messages

SCSI subsystem initialized
ACPI Exception (processor_core_0787): Processor device isn't present

a set of messages about usb
...
Trying manual resume from /dev/sda2  /* it's swap 
partition*/

resume device /dev/sda2 not found (ignoring)
...
Waiting for device 
/dev/disk/by-id/scsi-SATA-WDC_WDname_of_disk-part1 ... /* echo from 
udev.sh */


and then the proposal to try again. After finish of this script I 
don't see any HDDs in /dev.



BIOS setting for this SATA device is enhanced. compatible mode 
gives the same result.


What may be the source of the problem ? May be HDD driver used by 
initrd ?   


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry RAS
Moscow
PS.  If I see (after finish of udev.sh script) the content of /sys - 
it's right in NUMA sense, i.e.

/sys/devices/system/node contains normal node0 and node1.

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] bizarre scaling behavior on a Nehalem

2009-08-14 Thread Mikhail Kuzminsky

In message from Bill Broadley b...@cse.ucdavis.edu (Thu, 13 Aug 2009 
17:09:24 -0700):

Tom Elken wrote:
To add some details to what Christian says, the HPC Challenge 
version of
STREAM uses dynamic arrays and is hard to optimize.  I don't know 
what's
best with current compiler versions, but you could try some of these 
that

were used in past HPCC submissions with your program, Bill:


Thanks for the heads up, I've checked the specbench.org compiler 
options for
hints on where to start with optimization flags, but I didn't know 
about the

dynamic stream.

Is the HPC challenge code open source?


Yes, they are open.




PathScale 2.2.1 on Opteron:
Base OPT flags: -O3 -OPT:Ofast:fold_reassociate=0 
STREAMFLAGS=-O3 -OPT:Ofast:fold_reassociate=0 
-OPT:alias=restrict:align_unsafe=on -CG:movnti=1


Alas my pathscale license expired and I believe with sci-cortex's 
death (RIP)

I can't renew it.


Now I understand that I was sage :-)
(we purchased perpetual acafemic license). #1042;#1058;W, do 
somebody know about Pathscale compilers future (if it will be) ?


Mikhail



I tried open64-4.2.2 with those flags and on a nehalem single socket:

$ opencc -O4 -fopenmp stream.c -o stream-open64 -static
$ opencc -O4 -fopenmp stream-malloc.c -o stream-open64-malloc -static

$ ./stream-open64
Total memory required = 457.8 MB.
Function  Rate (MB/s)   Avg time Min time Max time
Copy:   22061.4958   0.0145   0.0145   0.0146
Scale:  8.4705   0.0144   0.0144   0.0145
Add:20659.2638   0.0233   0.0232   0.0233
Triad:  20511.0888   0.0235   0.0234   0.0235

Dynamic:
$ ./stream-open64-malloc

Function  Rate (MB/s)   Avg time Min time Max time
Copy:   14436.5155   0.0222   0.0222   0.0222
Scale:  14667.4821   0.0218   0.0218   0.0219
Add:15739.7070   0.0305   0.0305   0.0305
Triad:  15770.7775   0.0305   0.0304   0.0305


Intel C/C++ Compiler 10.1 on Harpertown CPUs:
Base OPT flags:  -O2 -xT -ansi-alias -ip -i-static
Intel recently used
Intel C/C++ Compiler 11.0.081 on Nehalem CPUs:
 -O2 -xSSE4.2 -ansi-alias -ip
and got good STREAM results in their HPCC submission on their 
ENdeavor cluster.


$ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream.c -o stream-icc
$ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream-malloc.c -o
stream-icc-malloc

$ ./stream-icc | grep :
STREAM version $Revision: 5.9 $
Copy:   14767.0512   0.0022   0.0022   0.0022
Scale:  14304.3513   0.0022   0.0022   0.0023
Add:15503.3568   0.0031   0.0031   0.0031
Triad:  15613.9749   0.0031   0.0031   0.0031
$ ./stream-icc-malloc | grep :
STREAM version $Revision: 5.9 $
Copy:   14604.7582   0.0022   0.0022   0.0022
Scale:  14480.2814   0.0022   0.0022   0.0022
Add:15414.3321   0.0031   0.0031   0.0031
Triad:  15738.4765   0.0031   0.0030   0.0031

So ICC does manage zero penalty, alas no faster than open64 with the 
penalty.


I'll attempt to track down the HPCC stream source code to see if 
their dynamic

arrays are any friendlier than mine (I just use malloc).

In any case many thanks for the pointer.

Oh, my dynamic tweak:
$ diff stream.c stream-malloc.c
43a44

# include stdlib.h

97c98
 static double  a[N+OFFSET],
---

/* static doublea[N+OFFSET],

99c100,102
c[N+OFFSET];
---

c[N+OFFSET]; */

double *a, *b, *c;

134a138,142


a=(double *)malloc(sizeof(double)*(N+OFFSET));
b=(double *)malloc(sizeof(double)*(N+OFFSET));
c=(double *)malloc(sizeof(double)*(N+OFFSET));


283c291,293

---

free(a);
free(b);
free(c);





___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin 
Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


--
üÔÏ ÓÏÏÂÝÅÎÉÅ ÂÙÌÏ ÐÒÏ×ÅÒÅÎÏ ÎÁ ÎÁÌÉÞÉÅ × ÎÅÍ ×ÉÒÕÓÏ×
É ÉÎÏÇÏ ÏÐÁÓÎÏÇÏ ÓÏÄÅÒÖÉÍÏÇÏ ÐÏÓÒÅÄÓÔ×ÏÍ
MailScanner, É ÍÙ ÎÁÄÅÅÍÓÑ
ÞÔÏ ÏÎÏ ÎÅ ÓÏÄÅÒÖÉÔ ×ÒÅÄÏÎÏÓÎÏÇÏ ËÏÄÁ.



___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] bizarre scaling behavior on a Nehalem

2009-08-14 Thread Mikhail Kuzminsky

In message from Bill Broadley b...@cse.ucdavis.edu (Thu, 13 Aug 2009 
17:09:24 -0700):


Do I unerstand correctly that this results are for 4 cores 4 openmp 
threads ? 
And what is DDR3 RAM: DDR3/1066 ?


Mikhail




I tried open64-4.2.2 with those flags and on a nehalem single socket:
$ opencc -O4 -fopenmp stream.c -o stream-open64 -static
$ opencc -O4 -fopenmp stream-malloc.c -o stream-open64-malloc -static
$ ./stream-open64
Total memory required = 457.8 MB.
Function  Rate (MB/s)   Avg time Min time Max time
Copy:   22061.4958   0.0145   0.0145   0.0146
Scale:  8.4705   0.0144   0.0144   0.0145
Add:20659.2638   0.0233   0.0232   0.0233
Triad:  20511.0888   0.0235   0.0234   0.0235
Dynamic:
$ ./stream-open64-malloc

Function  Rate (MB/s)   Avg time Min time Max time
Copy:   14436.5155   0.0222   0.0222   0.0222
Scale:  14667.4821   0.0218   0.0218   0.0219
Add:15739.7070   0.0305   0.0305   0.0305
Triad:  15770.7775   0.0305   0.0304   0.0305


Intel C/C++ Compiler 10.1 on Harpertown CPUs:
Base OPT flags:  -O2 -xT -ansi-alias -ip -i-static
Intel recently used
Intel C/C++ Compiler 11.0.081 on Nehalem CPUs:
 -O2 -xSSE4.2 -ansi-alias -ip
and got good STREAM results in their HPCC submission on their 
ENdeavor cluster.


$ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream.c -o stream-icc
$ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream-malloc.c -o
stream-icc-malloc

$ ./stream-icc | grep :
STREAM version $Revision: 5.9 $
Copy:   14767.0512   0.0022   0.0022   0.0022
Scale:  14304.3513   0.0022   0.0022   0.0023
Add:15503.3568   0.0031   0.0031   0.0031
Triad:  15613.9749   0.0031   0.0031   0.0031
$ ./stream-icc-malloc | grep :
STREAM version $Revision: 5.9 $
Copy:   14604.7582   0.0022   0.0022   0.0022
Scale:  14480.2814   0.0022   0.0022   0.0022
Add:15414.3321   0.0031   0.0031   0.0031
Triad:  15738.4765   0.0031   0.0030   0.0031

So ICC does manage zero penalty, alas no faster than open64 with the 
penalty.


I'll attempt to track down the HPCC stream source code to see if 
their dynamic

arrays are any friendlier than mine (I just use malloc).

In any case many thanks for the pointer.

Oh, my dynamic tweak:
$ diff stream.c stream-malloc.c
43a44

# include stdlib.h

97c98
 static double  a[N+OFFSET],
---

/* static doublea[N+OFFSET],

99c100,102
c[N+OFFSET];
---

c[N+OFFSET]; */

double *a, *b, *c;

134a138,142


a=(double *)malloc(sizeof(double)*(N+OFFSET));
b=(double *)malloc(sizeof(double)*(N+OFFSET));
c=(double *)malloc(sizeof(double)*(N+OFFSET));


283c291,293

---

free(a);
free(b);
free(c);





___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin 
Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


--
üÔÏ ÓÏÏÂÝÅÎÉÅ ÂÙÌÏ ÐÒÏ×ÅÒÅÎÏ ÎÁ ÎÁÌÉÞÉÅ × ÎÅÍ ×ÉÒÕÓÏ×
É ÉÎÏÇÏ ÏÐÁÓÎÏÇÏ ÓÏÄÅÒÖÉÍÏÇÏ ÐÏÓÒÅÄÓÔ×ÏÍ
MailScanner, É ÍÙ ÎÁÄÅÅÍÓÑ
ÞÔÏ ÏÎÏ ÎÅ ÓÏÄÅÒÖÉÔ ×ÒÅÄÏÎÏÓÎÏÇÏ ËÏÄÁ.



___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] bizarre scaling behavior on a Nehalem

2009-08-14 Thread Mikhail Kuzminsky

In message from Tom Elken tom.el...@qlogic.com (Fri, 14 Aug 2009 
13:57:53 -0700):

On Behalf Of Bill Broadley


I put DDR3-1333 in the machine, but the bios seems to want to run 
them

at
1066, 


How many dimms per memory channel do you have?

My understanding (which may be a few months old) is that if you have 
more than one dimm per memory channel, DDR3-1333 dimms will run at 
1066 speed;
i.e. on your 1-CPU system, if you have 6 dimms, you have 2 per memory 
channel.



I'm not sure exactly what speed they are running at.


Your results look excellent, so I wouldn't be surprised if they are 
running at 1333.


I have 12-18 GB/s on 4 threads of stream/ifort w/DDR3-1066 on dual 
E5520 server. But it works under numa-bad kernel w/o control of 
numa-efficient allocation.


Mikhail 



-Tom


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin
Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


--
ьФП УППВЭЕОЙЕ ВЩМП РТПЧЕТЕОП ОБ ОБМЙЮЙЕ Ч ОЕН ЧЙТХУПЧ
Й ЙОПЗП ПРБУОПЗП УПДЕТЦЙНПЗП РПУТЕДУФЧПН
MailScanner, Й НЩ ОБДЕЕНУС
ЮФП ПОП ОЕ УПДЕТЦЙФ ЧТЕДПОПУОПЗП ЛПДБ.



___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] bizarre scaling behavior on a Nehalem

2009-08-14 Thread Mikhail Kuzminsky

In message from Bill Broadley b...@cse.ucdavis.edu (Fri, 14 Aug 2009 
16:13:21 -0700):

Mikhail Kuzminsky wrote:

Your results look excellent, so I wouldn't be surprised if they are
running at 1333.


I have 12-18 GB/s on 4 threads of stream/ifort w/DDR3-1066 on dual 
E5520

server. But it works under numa-bad kernel w/o control of
numa-efficient allocation.


Sounds pretty bad.

Why 4 threads?  You need 8 cores to keep all 6 memory busses busy.


For comparison w/your tests: you have only 4 cores. On 8 threads I 
have 20-26 GB/s.


Which compiler?
 
ifort pointed above means intel fortran 11.0.38.


Mikhail


open64 does substantially better than gcc.

--
üÔÏ ÓÏÏÂÝÅÎÉÅ ÂÙÌÏ ÐÒÏ×ÅÒÅÎÏ ÎÁ ÎÁÌÉÞÉÅ × ÎÅÍ ×ÉÒÕÓÏ×
É ÉÎÏÇÏ ÏÐÁÓÎÏÇÏ ÓÏÄÅÒÖÉÍÏÇÏ ÐÏÓÒÅÄÓÔ×ÏÍ
MailScanner, É ÍÙ ÎÁÄÅÅÍÓÑ
ÞÔÏ ÏÎÏ ÎÅ ÓÏÄÅÒÖÉÔ ×ÒÅÄÏÎÏÓÎÏÇÏ ËÏÄÁ.



___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] performance tweaks and optimum memory configs for a Nehalem

2009-08-11 Thread Mikhail Kuzminsky

In message from Rahul Nabar rpna...@gmail.com (Sun, 9 Aug 2009 
22:42:25 -0500):

(a) I am seeing strange scaling behaviours with Nehlem cores. eg A
specific DFT (Density Functional Theory) code we use is maxing out
performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are
actually slower than 2 and 4 cores (depending on setup)


If this results are for HyperThreading ON, it may be not too strange 
because of virtual cores competition.


But if this results are for switched off Hyperthreading - it's 
strange.
I have usual good DFT scaling w/number of cores on G03 - about in 7 
times for 8 cores.


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry RAS
Moscow
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] numactl SuSE11.1

2009-08-11 Thread Mikhail Kuzminsky

It's interesting, that for this hardsoftware configuration disabling 
of NUMA in BIOS gives more high STREAM results in comparison w/NUMA 
enabled.


I.e. for NUMA off: 8723/8232/10388/10317 MB/s
for NUMA on: 5620/5217/6795/6767 MB/s
(both for OMP_NUM_THREADS=1 and ifort 11.1 compiler).

The situation for Opteron's is opposite: NUMA mode gives more high 
throughput.



In message from Mikhail Kuzminsky k...@free.net (Mon, 10 Aug 2009 
21:43:56 +0400):

I'm sorry for my mistake:
the problem is on Nehalem Xeon under SuSE -11.1, but w/kernel 
2.6.27.7-9 (w/Supermicro X8DT mobo). For Opteron 2350 w/SuSE 10.3 (w/ 
more old 2.6.22.5-31 -I erroneously inserted this string in my 
previous message) numactl works OK (w/Tyan mobo).


NUMA is enabled in BIOS. Of course, CONFIG_NUMA (and CONFIG_NUMA_EMU) 
are setted to y in both kernels.


Unfortunately I (i.e. root) can't change files in 
/sys/devices/system/node (or rename directory node2 to node1) :-( - as 
it's possible w/some files in /proc filesystem. It's interesting, that 
extraction from dmesg show, that IT WAS NODE1, but then node2 is 
appear !


ACPI: SRAT BF79A4B0, 0150 (r1 041409 OEMSRAT 1 INTL1)
ACPI: SSDT BF79FAC0, 249F (r1 DpgPmmCpuPm   12 INTL 20051117)
ACPI: Local APIC address 0xfee0
SRAT: PXM 0 - APIC 0 - Node 0
SRAT: PXM 0 - APIC 2 - Node 0
SRAT: PXM 0 - APIC 4 - Node 0
SRAT: PXM 0 - APIC 6 - Node 0
SRAT: PXM 1 - APIC 16 - Node 1
SRAT: PXM 1 - APIC 18 - Node 1
SRAT: PXM 1 - APIC 20 - Node 1
SRAT: PXM 1 - APIC 22 - Node 1
SRAT: Node 0 PXM 0 0-a
SRAT: Node 0 PXM 0 10-c000
SRAT: Node 0 PXM 0 1-1c000
SRAT: Node 2 PXM 257 1c000-34000
(here !!)

NUMA: Allocated memnodemap from 1c000 - 22880
NUMA: Using 20 for the hash shift.
Bootmem setup node 0 -0001c000
  NODE_DATA [00022880 - 0003a87f]
  bootmap [0003b000 -  00072fff] pages 38
(8 early reservations) == bootmem [00 - 01c000]
  #0 [00 - 001000]   BIOS data page == [00 - 
001000]
  #1 [006000 - 008000]   TRAMPOLINE == [006000 - 
008000]
  #2 [20 - bf27b8]TEXT DATA BSS == [20 - 
bf27b8]
  #3 [0037a3b000 - 0037fef104]  RAMDISK == [0037a3b000 - 
0037fef104]
  #4 [09cc00 - 10]BIOS reserved == [09cc00 - 
10]
  #5 [01 - 013000]  PGTABLE == [01 - 
013000]
  #6 [013000 - 01c000]  PGTABLE == [013000 - 
01c000]
  #7 [01c000 - 022880]   MEMNODEMAP == [01c000 - 
022880]

Bootmem setup node 2 0001c000-00034000
  NODE_DATA [0001c000 - 0001c0017fff]
  bootmap [0001c0018000 -  0001c0047fff] pages 30
(8 early reservations) == bootmem [01c000 - 034000]
  #0 [00 - 001000]   BIOS data page
  #1 [006000 - 008000]   TRAMPOLINE
  #2 [20 - bf27b8]TEXT DATA BSS
  #3 [0037a3b000 - 0037fef104]  RAMDISK
  #4 [09cc00 - 10]BIOS reserved
  #5 [01 - 013000]  PGTABLE
  #6 [013000 - 01c000]  PGTABLE
  #7 [01c000 - 022880]   MEMNODEMAP
found SMP MP-table at [880ff780] 000ff780
 [e200-e20006ff] PMD - 
[88002820-88002e1f] on node 0
 [e2000700-e2000cff] PMD - 
[8801c020-8801c61f] on node 2
 
Mikhail Kuzminsky

Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry RAS
Moscow
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] numactl SuSE11.1

2009-08-10 Thread Mikhail Kuzminsky


I'm sorry for my mistake:
the problem is on Nehalem Xeon under SuSE -11.1, but w/kernel 
2.6.27.7-9 (w/Supermicro X8DT mobo). For Opteron 2350 w/SuSE 10.3 (w/ 
more old 2.6.22.5-31 -I erroneously inserted this string in my 
previous message) numactl works OK (w/Tyan mobo).


NUMA is enabled in BIOS. Of course, CONFIG_NUMA (and CONFIG_NUMA_EMU) 
are setted to y in both kernels.


Unfortunately I (i.e. root) can't change files in 
/sys/devices/system/node (or rename directory node2 to node1) :-( - as 
it's possible w/some files in /proc filesystem. It's interesting, that 
extraction from dmesg show, that IT WAS NODE1, but then node2 is 
appear !


ACPI: SRAT BF79A4B0, 0150 (r1 041409 OEMSRAT 1 INTL1)
ACPI: SSDT BF79FAC0, 249F (r1 DpgPmmCpuPm   12 INTL 20051117)
ACPI: Local APIC address 0xfee0
SRAT: PXM 0 - APIC 0 - Node 0
SRAT: PXM 0 - APIC 2 - Node 0
SRAT: PXM 0 - APIC 4 - Node 0
SRAT: PXM 0 - APIC 6 - Node 0
SRAT: PXM 1 - APIC 16 - Node 1
SRAT: PXM 1 - APIC 18 - Node 1
SRAT: PXM 1 - APIC 20 - Node 1
SRAT: PXM 1 - APIC 22 - Node 1
SRAT: Node 0 PXM 0 0-a
SRAT: Node 0 PXM 0 10-c000
SRAT: Node 0 PXM 0 1-1c000
SRAT: Node 2 PXM 257 1c000-34000
(here !!)

NUMA: Allocated memnodemap from 1c000 - 22880
NUMA: Using 20 for the hash shift.
Bootmem setup node 0 -0001c000
  NODE_DATA [00022880 - 0003a87f]
  bootmap [0003b000 -  00072fff] pages 38
(8 early reservations) == bootmem [00 - 01c000]
  #0 [00 - 001000]   BIOS data page == [00 - 
001000]
  #1 [006000 - 008000]   TRAMPOLINE == [006000 - 
008000]
  #2 [20 - bf27b8]TEXT DATA BSS == [20 - 
bf27b8]
  #3 [0037a3b000 - 0037fef104]  RAMDISK == [0037a3b000 - 
0037fef104]
  #4 [09cc00 - 10]BIOS reserved == [09cc00 - 
10]
  #5 [01 - 013000]  PGTABLE == [01 - 
013000]
  #6 [013000 - 01c000]  PGTABLE == [013000 - 
01c000]
  #7 [01c000 - 022880]   MEMNODEMAP == [01c000 - 
022880]

Bootmem setup node 2 0001c000-00034000
  NODE_DATA [0001c000 - 0001c0017fff]
  bootmap [0001c0018000 -  0001c0047fff] pages 30
(8 early reservations) == bootmem [01c000 - 034000]
  #0 [00 - 001000]   BIOS data page
  #1 [006000 - 008000]   TRAMPOLINE
  #2 [20 - bf27b8]TEXT DATA BSS
  #3 [0037a3b000 - 0037fef104]  RAMDISK
  #4 [09cc00 - 10]BIOS reserved
  #5 [01 - 013000]  PGTABLE
  #6 [013000 - 01c000]  PGTABLE
  #7 [01c000 - 022880]   MEMNODEMAP
found SMP MP-table at [880ff780] 000ff780
 [e200-e20006ff] PMD - 
[88002820-88002e1f] on node 0
 [e2000700-e2000cff] PMD - 
[8801c020-8801c61f] on node 2


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry RAS
Moscow
 
___

Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Tyan S7002 for Nehalem-based nodes

2009-05-05 Thread Mikhail Kuzminsky

Is there some contra-indications for using of Tyan S7002 AG2NR 
w/Xeon 5520 for cluster nodes ? May be somebody have some experience 
w/S7002 ?


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry RAS
Moscow
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] FPU performance of Intel CPUs

2009-04-06 Thread Mikhail Kuzminsky

In message from John Hearns hear...@googlemail.com (Mon, 6 Apr 2009 
17:45:37 +0100):

2009/4/6 Jones de Andrade johanne...@gmail.com:

That's a thing that rises a question for me...

Will beowulfers start to accept manufacturers auto-overclock as a
feature... ou will choose motherboards that allows you to disable 
this?  ;)


Concerning Nehalems, of course.


I read up about this.
You can always disable it using ACPI


If you use good parallelized program w/high CPUs utilization, 
I beleive, you SHOULD disable turbo-boost mode :-)


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry RAS
Moscow





___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] X5500

2009-03-31 Thread Mikhail Kuzminsky

In message from Kilian CAVALOTTI kilian.cavalotti.w...@gmail.com 
(Tue, 31 Mar 2009 10:27:55 +0200):

...
Any other numbers, people?


I beleive there is also a bit other important numbers - prices for 
Xeon 55XX and system boards ;-)


I didn't see prices on pricegrabber, for example.
Is there some price information available ?

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemostry RAS
Moscow
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Lowered latency with multi-rail IB?

2009-03-27 Thread Mikhail Kuzminsky

In message from Dow Hurst DPHURST dphu...@uncg.edu (Thu, 26 Mar 2009 
23:32:23 -0400):
We've got a couple of weeks max to finalize spec'ing a new cluster.  
Has 
anyone knowledge of lowering latency for NAMD by implementing a 
multi-rail IB solution using MVAPICH or Intel's MPI?  My research 
tells 
me low latency is key to scaling our code of choice, NAMD, 
effectively.  Has anyone cut down real effective latency 
to below 1.0us using multi-rail IB for molecular dynamics codes such 
Gromacs, Amber, CHARMM, or NAMD?  What about lowered latency for 
parallel abnitio calculations involving NwChem, Jaguar, or Gaussian 
using multi-rail IB?


In opposition to molecular dynamics programs (Gromacs/Amber/Charmm) 
where low latency is necessary, for some quantum chemical programs 
(Gaussian, Gamess-US) there is relative low interconnect dependency.
I measured message lengthes for Gaussian-03 for a set of calculation 
methods, and this messages are middle-to-large in sizes. NWChem is the 
only quantum-chemical program I know, which require high interconnect 
performance. I don't know about Jaguar.


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry RAS
Moscow 




If so, what was the configuration of cards and software?  Any caveats 
involved, except price? 

;-)

Multi-rail IB is not something I know much about so am trying to get 
up to speed on what is possible and what is not.  I do understand 
that lowering latency using multi-rail has to come from the MPI layer 
knowing how to use the hardware properly and some MPI implementations 
have such options and others don't.  I understand that MVAPICH has 
some capabilities to use multi-rail and that NAMD is run on top of 
MVAPICH on many IB based clusters.  Any links or pointers to how I 
can quickly educate myself on the topic would be appreciated.

Best wishes,

Dow

__
Dow P. Hurst, Research Scientist
Department of Chemistry and Biochemistry
University of North Carolina at Greensboro
435 New Science Bldg.
Greensboro, NC 27402-6170
dphu...@uncg.edu
dow.hu...@mindspring.com
336-334-5122 office
336-334-4766 lab
336-334-5402 fax
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Sun X4600 STREAM results

2009-03-16 Thread Mikhail Kuzminsky

Sorry, do somebody have X4600 M2 Stream results (or the corresponding 
URLs) for DDR2/667 - w/dependance from processor core numbers?


Mikhail Kuzminsky
Computer Assistance to Chemical Reserach Center
Zelinsky Institute of Organic Chemistry RAS
Moscow 
___

Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Grid scheduler for Windows XP

2009-03-05 Thread Mikhail Kuzminsky

In message from Sangamesh B forum@gmail.com (Thu, 5 Mar 2009 
01:29:09 -0500):

Hello everyone,

  Is there a Grid scheduler (only open source, like SGE) tool which
can be installed/run on Windows XP Desktop systems (there is no Linux
involvement strictly).

The applications used under this grid are Native to Windows XP.


GRAM component of Globus Toolkit (http://www.globus.org/) give you 
some possibilities of batch queue system, and there is SGE interfaces 
to Globus. 


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry RAS
Moscow


Thanks,
Sangamesh
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] RE:small distro for PXE boot, autostarts sshd?

2009-02-27 Thread Mikhail Kuzminsky

In message from Greg Keller g...@keller.net (Fri, 27 Feb 2009 
10:20:50 -0600):
Have you ever considered Perceus (Caos has it baked in) from 
infiscale?
...  
http://www.infiscale.com/


It looks that there is only one way to understand a bit more detailed 
what Perceus does - to download it :-)

What is known about OpenSuSE/SLES distros + Perceus ?
(Or any other choice for SuSE distros)

Mikhail Kuaminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry RAS
Moscow
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] How many double precision floating point operations per clock cycle for AMD Barcelona?

2009-02-12 Thread Mikhail Kuzminsky

In message from Prakashan Korambath p...@ats.ucla.edu (Tue, 10 Feb 
2009 08:23:05 -0800):
Could someone confirm the number of double precision floating point 
operations (FLOPS) for AMD Barcelona chips? The URL below seems to 
indicate 4 FLOPS per cycle.  I just want to confirm it.  Thanks.


4 FLOPS per core.

Mikhail Kuzminskiy,
Computer Assistance to Chemical Research Center,
Zelinsky Institute of Organic Chemistry RAS,
Moscow




http://forums.amd.com/devblog/blogpost.cfm?catid=253threadid=87799


Prakashan
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Hadoop

2008-12-29 Thread Mikhail Kuzminsky

In message from Gerry Creager gerry.crea...@tamu.edu (Mon, 29 Dec 
2008 09:01:21 -0600):


As for Fortran vs C, real scientists program in Fortran.  Real Old 
Scientists program in Fortran-66.  Carbon-dated scientists can still 
recall IBM FORTRAN-G and -H.


:-) I didn't check, but may be I just have Fortran-G and H on my PC - 
as a part of free Turnkey MVS distribution working w/(free) Hercules 
emulator for IBM mainframes.  



Actually, a number of our mathematicians use C for their codes, but 
don't seem to be doing much more than theoretical codes.  The guys 
who're wwriting/rewriting practical codes (weather models, 
computational chemistry, reservoir simulations in solid earth) seem 
to stick to Fortran here.


Our group works in area of computational chemistry, and of course we 
write the programs on Fortran (95) :-) But I'm afraid that we'll start 
here the new cycle of religious language war :-)


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow


gerry

Jeff Layton wrote:
I hate to tangent (hijack?) this subject, but I'm curious about your 
class poll. Did the people who were interested in Matlab consider 
Octave?


Thanks!

Jeff


*From:* Joe Landman land...@scalableinformatics.com
*To:* Jeff Layton layto...@att.net
*Cc:* Gerry Creager gerry.crea...@tamu.edu; Beowulf Mailing List 
beowulf@beowulf.org

*Sent:* Saturday, December 27, 2008 11:11:20 AM
*Subject:* Re: [Beowulf] Hadoop

N.B. the recent MPI class we gave suggested that we need to re-tool 
it
to focus more upon Fortran than C.  There was no interest in Java 
from
the class I polled.  Some researchers want to use Matlab for their 
work,
but most university computing facilities are loathe to spend the 
money
to get site licenses for Matlab.  Unfortunate, as Matlab is a very 
cool
tool (been playing with it first in 1988 ...) its just not fast. 
The
folks at Interactive Supercomputing might be able to help with this 
with

their compiler.



--
Gerry Creager -- gerry.crea...@tamu.edu
Texas Mesonet -- AATLT, Texas AM University
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Parallel software for chemists

2008-12-12 Thread Mikhail Kuzminsky

In message from Dr Cool Santa drcoolsa...@gmail.com (Wed, 10 Dec 
2008 19:21:43 +0530):
Currently in the lab we use Schrodinger and we are looking into 
NWchem. We'd
be interested in knowing about software that a chemist could use that 
makes

use of a parallel supercomputer. And better if it is linux.


To say shortly, practically all the modern software for molecular 
modelling calculations can run in parallel on Linux clusters.


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry RAS
Moscow 



--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Сlos network vs fat tree

2008-11-13 Thread Mikhail Kuzminsky

Sorry, is it correct to say that fat tree topology is equal to 
*NON-BLOCKING* Clos network w/addition of uplinks ? I.e. any 
non-blocking Clos network w/corresponding addition of uplinks gives 
fat tree ?


I read somewhere that exact evidence of non-blocking was performed 
for Clos networks with = 3 levels. But most popular Infiniband fat 
trees has only 2 levels.


(Yes, I know that non-blocking for Clos network isn't absolute 
:-))


Mikhail  
___

Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: Re[2]: [Beowulf] Shanghai vs Barcelona, Shanghai vs Nehalem

2008-10-23 Thread Mikhail Kuzminsky

In message from Jan Heichler [EMAIL PROTECTED] (Wed, 22 Oct 2008 
20:27:40 +0200):

Hallo Mikhail,

Mittwoch, 22. Oktober 2008, meintest Du:

MK In message from Ivan Oleynik [EMAIL PROTECTED] (Tue, 21 Oct 
2008

MK 18:15:49 -0400):
I have heard that AMD Shanghai will be available in Nov 2008. Does 
someone
know the pricing and performance info and how is it compared with 
Barcelona?

Are there some informal comparisons of Shanghai vs Nehalem?


MK I beleive that Shanghai performance increase in comparison 
w/Barcelona
MK will be practically defined only by possible higher Shanghai 
MK frequencies. 

You can expect to see better performance in SPEC_CPU for Shanghai vs. 
Barcelona when comparing identical clockspeeds. But of course the 
increased clockspeed ist a big argument for Shanghai (or the same 
clockspeed with less energy consumption). 

And Shanghai has some more features like faster memory and HT3 in 
some of the later revisions i hope...


Yes, I think HT3 *must* be. It was declared for Barcelona, but really 
is supported now AFAIK only for desktop chips.


Mikhail



Jan


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Shanghai vs Barcelona, Shanghai vs Nehalem

2008-10-22 Thread Mikhail Kuzminsky

In message from Ivan Oleynik [EMAIL PROTECTED] (Tue, 21 Oct 2008 
18:15:49 -0400):
I have heard that AMD Shanghai will be available in Nov 2008. Does 
someone
know the pricing and performance info and how is it compared with 
Barcelona?

Are there some informal comparisons of Shanghai vs Nehalem?


I beleive that Shanghai performance increase in comparison w/Barcelona 
will be practically defined only by possible higher Shanghai 
frequencies. 


Mikhail
 



Thanks,

Ivan


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Shanghai vs Barcelona, Shanghai vs Nehalem

2008-10-22 Thread Mikhail Kuzminsky

In message from Mark Hahn [EMAIL PROTECTED] (Wed, 22 Oct 2008 
13:23:08 -0400 (EDT)):

Are there some informal comparisons of Shanghai vs Nehalem?


I beleive that Shanghai performance increase in comparison 
w/Barcelona will 
be practically defined only by possible higher Shanghai frequencies.


is that based on anything hands-on?


No, I'm not under NDA - because I don't have Shanghai chips in hands
:-)

Mikhail



IMO, AMD needs to get a bit more serious about competing.  if I7
ships with ~15 GB/s per socket and working multi-socket scalability,
it's hard to imagine why anyone would bother to look at AMD.  either:

- there is some sort of significant flaw with I7 (runs like a
dog in 64b mode or Hf turns into blue cheese after a year, etc).

- AMD gets its act together (lower-latency L3, highly efficient
ddr3/1333 interface, directory-based coherence).

- AMD satisfies itself with bottom-feeding (which probably
also means only low-end consumer stuff with little HPC interest).

I've had good reason to be an AMD fan in recent-ish years, but if 
Intel
is firing on all cylinders, AMD needs to be the rotary engine or have 
more

cylinders, or something...


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Nehalem Xeons

2008-10-14 Thread Mikhail Kuzminsky

In message from Håkon Bugge [EMAIL PROTECTED] (Tue, 14 Oct 2008 
07:50:32 +0200):
They are _definitively_ worth waiting for, although I am not familiar 
with the release timing. But I have been running on a dual-socket 
with 8 cores and 16 SMTs. And I say they are worth waiting for.


Q1'2009 - unfortunately, I don't know more exactly :-(

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center,
Zelinsky Institute of Organic Chemistry
Moscow





Håkon
At 01:57 14.10.2008, Ivan Oleynik wrote:
I am still in process of purchasing a new cluster and consider 
whether is worth waiting for new Intel Xeons. Does someone know when 
Intel will start selling Xeons based on Nehalem architecture? They 
announced desktop launch (Nov 16) but were quiet about server 
release.


Thanks,

Ivan

___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


--
Håkon Bugge
Chief Technologist
mob. +47 92 48 45 14
off. +47 21 37 93 19
fax. +47 22 23 36 66
[EMAIL PROTECTED]
Skype: hakon_bugge

Platform Computing, Inc.


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Re: Beowulf Digest, Vol 55, Issue 2

2008-09-04 Thread Mikhail Kuzminsky

In message from Li, Bo [EMAIL PROTECTED] (Thu, 4 Sep 2008 14:34:00 
+0800):

Hello,
Is it too expensive for the platform?
The easy solution is:
And X48 level motherboard with CF support, about $150 
Q6600 Processor, about $170

Two 4870X2 $1,100


Do somebody know, are ACML routines parallelized for using of few 
GPGPUs ?


Mikhail



Two Seagate SATA Harddisk 500G for Raid1, about $140
4*2G DDR2 RAM, about $150
PSU 1000W, about $200
A big box, about $100

That's all, in total, $2,010.
Regards,
Li, Bo
 - Original Message - 
 From: Maurice Hilarius 
 To: beowulf@beowulf.org 
 Cc: [EMAIL PROTECTED] ; [EMAIL PROTECTED] ; [EMAIL PROTECTED] 
 Sent: Thursday, September 04, 2008 6:51 AM

 Subject: Re: Beowulf Digest, Vol 55, Issue 2


  Li, bo wrote: 
..

From: Li, Bo [EMAIL PROTECTED]
Subject: Re: [Beowulf] gpgpu

Hello,
It seemed that you had got a very good example for GPGPU. As I said 
before, it's not the time for GPGPU to do the DP calculation at the 
moment. If you can bear SP computation, you will find more about it.
NVidia just sent me some special offer about their Tesla platforms, 
which said that the workstation equipped with two GTX280 level 
professional cards costs about $5000, not bad. But my intention is 
still to lower the core frequency of a gaming card, and use it for 
computation.

Regards,
Li, Bo
 Looking at AMD/ATI Firestream and 4850 pricing, it is not too bad:

 AMD FIRESTREAM 9250 STREAM PROCESSOR (P/N: 100-505563)$880
 VISIONTEK RADEON HD4870X2 2GB PCI-E (P/N: 900250) 
$575
 VISIONTEK RADEON HD 4870 512MB PCI-E (P/N: 900244) 
   $355


 The 4870 and X2 also run the AMD code.

 So, given a decent machine, with 4 cores and a pair of the 4870X2, 
one can achieve some pretty amazing GPU 
 performance levels for a system well under $4,000.


 With dualX2s ( 4 GPU engines) around $4700 ( extra PSU capacity and 
cooling is needed for that level).


 I hear that AMD have a new Firestream coming, with the 48x0 family 
chips on it, but that will likely be a bit on the pricier side..


 Anyway, the Firestream has GPUs with Double-Precision Floating 
Point.

 Something the  nVidia offerings do not.

 Worth considering.

 http://ati.amd.com/technology/streamcomputing/product_firestream_9250.html

 SDK:
 http://ati.amd.com/technology/streamcomputing/sdkdwnld.html




 -- 
 With our best regards,


 Maurice W. Hilarius Telephone: 01-780-456-9771
 Hard Data Ltd.FAX:  01-780-456-9772
 11060 - 166 Avenue email:[EMAIL PROTECTED]
 Edmonton, AB, Canada http://www.harddata.com/
  T5X 1Y3


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] gpgpu

2008-08-28 Thread Mikhail Kuzminsky

In message from Li, Bo [EMAIL PROTECTED] (Thu, 28 Aug 2008 14:20:15 
+0800):

...
Currently, the DP performance of GPU is not good as we expected, or 
only 1/8 1/10 of SP Flops. It is also a problem.


AMD data: Firestream 9170 SP performance is 5 GFLOPS/W vs 1 GFLOPS/W 
for DP. It's 5 times slower than SP.


Firestream 9250 has 1 TFLOPS for SP, therefore 1/5 is about 200 GFLOPS 
DP. The price will be, I suppose, about $2000 - as for 9170.


Let me look to modern dual socket quad-core beowulf node w/price about 
$4000+, for example. For Opteron 2350/2 Ghz chips (I use) peak DP 
performance is 64 GFLOPS (8 cores). For 3 Ghz Xeon chips - about 100 
GFLOPS. 


Therefore GPGPU peak DP performance is 1.5-2 times higher than w/CPUs.
Is it enough for essential calculation speedup - taking into account 
time for data transmission to/from GPU ?

I would suggest hybrid computation platforms, with GPU, CPU, and 
processors like Clearspeed. It may be a good topic for programming 
model.


Clearspeed, if there is no new hardware now, has not enough DP 
performance in comparison w/typical modern servers on quad-core CPUs.


Yours
Mikhail 


Regards,
Li, Bo
- Original Message - 
From: Vincent Diepeveen [EMAIL PROTECTED]

To: Li, Bo [EMAIL PROTECTED]
Cc: Mikhail Kuzminsky [EMAIL PROTECTED]; Beowulf 
beowulf@beowulf.org

Sent: Thursday, August 28, 2008 12:22 AM
Subject: Re: [Beowulf] gpgpu



Hi Bo,

Thanks for your message.

What library do i call to find primes?

Currently it's searching here after primes (PRP's)  in the form of p 


= (2^n + 1) / 3

n is here about 1.5 million bits roughly as we speak.

For SSE2 type processors there is the George Woltman assembler code 


(MiT) to do the squaring + implicit modulo;
how do you plan to beat that type of real optimized number crunching 


at a GPU?

You'll have to figure out a way to find an instruction level  
parallellism of at least 32,
which also doesn't write to the same cacheline, i *guess* (no  
documentation to verify that in fact).


So that's a range of 256 * 32 = 2^8 * 2^5 = 2^13 = 8192 bytes

In fact the first problem to solve is to do some sort of squaring  
real quickly.


If you figured that out at a PC, experience learns you're still  
losing a potential of factor 8,

thanks to another zillion optimizations.

You're not allowed to lose factor 8. that 52 gflop a gpu can deliver 


on paper @ 250 watt TDP (you bet it will consume that
when you let it work so hard) means GPU delivers effectively less  
than 7 gflops double precision thanks to inefficient code.


Additionally remember the P4. On paper in integers claim was when it 


released it would be able to execute 4 integers a
cycle, reality is that it was a processor getting an IPC far under 1 


for most integer codes. All kind of stuff sucked at it.

The experience learns this is the same for todays GPU's, the  
scientists who have run codes on it so far and are really 
experienced
CUDA programmers, figured out the speed it delivers is a very big  
bummer.


Additionally 250 watt TDP for massive number crunching is too much.

It's well over factor 2 power consumption of a quadcore. Now i can  
take a look soon in China myself what power prices

are over there, but i can assure you they will rise soon.

Now that's a lot less than a quadcore delivers with a tdp far under 


100 watt.

Now i explicitly mention the n's i'm searching here, as it should 
fit  
within caches.
So the very secret bandwidth you can practical achieve (as we know  
nvidia lobotomized
bandwidth in the GPU cards, only the Tesla type seems to be not  
lobotomized),

i'm not even teasing you with that.

This is true for any type of code. You're losing it to the details. 


Only custom tailored solutions will work,
simply because they're factors faster.

Thanks,
Vincent

On Aug 27, 2008, at 2:50 AM, Li, Bo wrote:


Hello,
IMHO, it is better to call the BLAS or similiar libarary rather  
than programing you own functions. And CUDA treats the GPU as a  
cluster, so .CU is not working as our normal codes. If you have got 

to many matrix or vector computation, it is better to use Brook+/ 
CAL, which can show great power of AMD gpu.

Regards,
Li, Bo
- Original Message -
From: Mikhail Kuzminsky [EMAIL PROTECTED]
To: Vincent Diepeveen [EMAIL PROTECTED]
Cc: Beowulf beowulf@beowulf.org
Sent: Wednesday, August 27, 2008 2:35 AM
Subject: Re: [Beowulf] gpgpu



In message from Vincent Diepeveen [EMAIL PROTECTED] (Tue, 26 Aug 2008
00:30:30 +0200):

Hi Mikhail,

I'd say they're ok for black box 32 bits calculations that can do  
with

a GB or 2 RAM,
other than that they're just luxurious electric heating.


I also want to have simple blackbox, but 64-bit (Tesla C1060 or
Firestream 9170 or 9250). Unfortunately the life isn't restricted to
BLAS/LAPACK/FFT :-)

So I'll need to program something other. People say that the best
choice is CUDA for Nvidia. When I look to sgemm source, it has  
about 1
thousand (or higher) strings

Re: [Beowulf] gpgpu

2008-08-26 Thread Mikhail Kuzminsky

In message from Vincent Diepeveen [EMAIL PROTECTED] (Tue, 26 Aug 2008 
00:30:30 +0200):

Hi Mikhail,

I'd say they're ok for black box 32 bits calculations that can do with 
a GB or 2 RAM,

other than that they're just luxurious electric heating.


I also want to have simple blackbox, but 64-bit (Tesla C1060 or 
Firestream 9170 or 9250). Unfortunately the life isn't restricted to 
BLAS/LAPACK/FFT :-) 

So I'll need to program something other. People say that the best 
choice is CUDA for Nvidia. When I look to sgemm source, it has about 1 
thousand (or higher) strings in *.cu files. Thereofore I think that a 
bit more difficult alghorithm  as some special matrix diagonalization 
will require a lot of programming work :-(.


It's interesting, that when I read Firestream Brook+ kernel function 
source example - for addition of 2 vectors (Building a High Level 
Language Compiler For GPGPU,

Bixia Zheng ([EMAIL PROTECTED])
Derek Gladding ([EMAIL PROTECTED])
Micah Villmow ([EMAIL PROTECTED])
June 8th, 2008)

- it looks SIMPLE. May be there are a lot of details/source lines 
which were omitted from this example ?




Vincent
p.s. if you ask me, honestely, 250 watt or so for latest gpu is really 
too much.


250 W is TDP, the average value declared is about 160 W. I don't 
remember, which GPU - from AMD or Nvidia - has a lot of special 
functional units for sin/cos/exp/etc. If they are not used, may be the 
power will a bit more lower.


What is about Firestream 9250, AMD says about 150 W (although I'm not 
absolutely sure that it's TDP) - it's as for some 
Intel Xeon quad-cores chips w/names beginning from X.


Mikhail



On Aug 23, 2008, at 10:31 PM, Mikhail Kuzminsky wrote:


BTW, why GPGPUs are considered as vector systems ?
Taking into account that GPGPUs contain many (equal) execution 
units,
I think it might be not SIMD, but SPMD model. Or it depends from  
the software tools used (CUDA etc) ?


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit  
http://www.beowulf.org/mailman/listinfo/beowulf






___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] hang-up of HPC Challenge

2008-08-20 Thread Mikhail Kuzminsky

In message from Greg Lindahl [EMAIL PROTECTED] (Tue, 19 Aug 2008 
19:39:38 -0700):

On Wed, Aug 20, 2008 at 03:45:43AM +0400, Mikhail Kuzminsky wrote:
For some localization of possible problem reason, I ran pure HPL 
test  
instead of HPCC. HPL performs direct output to screen instead of 
writing 
to the file.


Using MPICH w/np=8 I obtained normal HPL result for N=35000 - 
including  
3 PASSED strings for ||Ax-b|| calculations. BUT ! Linux hang-ups  
immediately after output of this strings.


Well, what did your configuration file tell HPL to do? Does it have
another test, perhaps a bigger one, or is it supposed to exit? We
aren't mind-readers.


Pls sorry: I performed now 2 HPL run cases for the same N=1, 

(1st) - single HPL run, i.e. ONE N=1, ONE blocksize value, and 
ONE any other HPL.dat parameter.


(2nd) - multiple HPL run w/same (one) N=1 and blocksize=100, but 
with a sets of PFACTS etc (see the output below).


1st run finished successfully, 2nd lead to Linux hang-up. 


Yours
Mikhail 


single HPL run :
HPLinpack 1.0a  --  High-Performance Linpack benchmark  --   January 
20, 2004
Written by A. Petitet and R. Clint Whaley,  Innovative Computing 
Labs.,  UTK



An explanation of the input/output parameters follows:
T/V: Wall time / encoded variant.
N  : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P  : The number of process rows.
Q  : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N  :   1
NB : 100
PMAP   : Row-major process mapping
P  :   2
Q  :   4
PFACT  :   Right
NBMIN  :   4
NDIV   :   2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :   1
SWAP   : Mix (threshold = 64)
L1 : transposed form
U  : transposed form
EQUIL  : yes
ALIGN  : 16 double precision words



- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
   1) ||Ax-b||_oo / ( eps * ||A||_1  * N)
   2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
   3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be 
1.110223e-16
- Computational tests pass if scaled residuals are less than 
 16.0



T/VNNB P Q   Time 
   Gflops


WR11C2R4   1   100 2 4  23.32 
2.859e+01


||Ax-b||_oo / ( eps * ||A||_1  * N) =0.0767386 .. 
PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =0.0181586 .. 
PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.0040588 .. 
PASSED



Finished  1 tests with the following results:
  1 tests completed and passed residual checks,
  0 tests completed and failed residual checks,
  0 tests skipped because of illegal input values.


End of Tests.

[1]+  Donempirun -np 8 xhpl

multiple HPL run:
HPLinpack 1.0a  --  High-Performance Linpack benchmark  --   January 
20, 2004
Written by A. Petitet and R. Clint Whaley,  Innovative Computing 
Labs.,  UTK



An explanation of the input/output parameters follows:
T/V: Wall time / encoded variant.
N  : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P  : The number of process rows.
Q  : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N  :   1
NB : 100
PMAP   : Row-major process mapping
P  :   2
Q  :   4
PFACT  :LeftCroutRight
NBMIN  :   24
NDIV   :   2
RFACT  :LeftCroutRight
BCAST  :   1ring
DEPTH  :   0
SWAP   : Mix (threshold = 64)
L1 : transposed form
U  : transposed form
EQUIL  : yes
ALIGN  : 16 double precision words



- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
   1) ||Ax-b||_oo / ( eps * ||A||_1  * N)
   2) ||Ax-b||_oo / ( eps

Re: [Beowulf] hang-up of HPC Challenge

2008-08-20 Thread Mikhail Kuzminsky

In message from Chris Samuel [EMAIL PROTECTED] (Wed, 20 Aug 2008 
11:12:52 +1000 (EST)):


- Mikhail Kuzminsky [EMAIL PROTECTED] wrote:


What else may be the reason of hangup ?


Depends what you mean by hangup  really..

Does the code crash, does it just stop  idle, does it
busy loop, does the node oops, does it lockup, etc ?


I beleive that program crash is not hangup. When I wrote about Linux 
hangup, I means that Linux don't response to any interrupts - from 
keyboard, from ssh client requests etc. 



If you're not already running a mainline kernel (say
2.6.26.2) it might also be worth giving that a
go too, we're happily doing it on our Barcelonas
(though on CentOS not SuSE).


I use 2.6.22.5-31 kernel from SuSE 10.3 distribution.

Mikhail





cheers,
Chris
--
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] new flash SSDs

2008-08-19 Thread Mikhail Kuzminsky

FYI: Intel presented on IDF new SATA 2.5 SSDs (based on NAND flash) 
for servers. This SSDs (X25-E Extreme, 32 GB) support command queueing 
(32 operations), R/W throughput = 250/170 MB/s, 75 usec read latency. 
35000 writes per second and 3300 reads per second - for 4 KB blocks. 
64 GB SDD is awaiting in Q1'2009.


I hope this will lead to decrease of SSD market price. Unfortunately I 
have no information about prices and about lifetime.


But I'm not too enthusiastic about prices: even Samsung PATA 2.5/32 
GB SDD costs about $300, IBM SATA are much more expensive. 


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow

___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] hang-up of HPC Challenge

2008-08-19 Thread Mikhail Kuzminsky

For some localization of possible problem reason, I ran pure HPL test 
instead of HPCC. HPL performs direct output to screen instead of 
writing to the file.


Using MPICH w/np=8 I obtained normal HPL result for N=35000 - 
including 
3 PASSED strings for ||Ax-b|| calculations. BUT ! Linux hang-ups 
immediately after output of this strings.


Mikhail 
 

In message from Mikhail Kuzminsky [EMAIL PROTECTED] (Mon, 18 Aug 2008 
22:20:16 +0400):
I ran a set of HPC Challenge benchmarks on ONE dual socket quad-core 
Opteron2350 (Rev. B3) based server (8 logical CPUs).
RAM size is 16 Gbytes. The tests performed were under SuSE 
10.3/x86-64, for LAM MPI 7.1.4 and MPICH 1.2.7 from SuSE 
distribution, using Atlas 3.9. Unfortunately there is only one such 
cluster node, and I can't reproduce the run on another node :-(


For N (matrix size) up to 1 all looks OK. But for more large N 
(15000/2/...) hpcc execution (mpirun -np 8 hpcc) leads to Linux 
hang-up.


In the top output I see 8 hpcc examplars each eating about 100% of 
CPU, and reasonable amounts of virtual and RSS memory per hpcc 
process, and the absense of swap using. Usually there is no PTRANS 
results in hpccoutf.txt results file, but in a few cases (when I 
activelly looked to hpcc execution by means of ps/top issuing) I 
see reasonable PTRANS results but absense of HPLinpack results. One 
time I obtained PTRANS, HPL and DGEMM results for N=2, but hangup 
later - on STREAM tests. May be it's simple because of absense (at 
hangup) of final writing of output buffer to output file on HDD.


One of possible reasons of hang-ups is memory hardware problem, but 
what is about possible software reasons of hangups ? 
The hpcc executable is 64-bit dynamically linked. 
/etc/security/limits.conf is empty. stacksize limit (for user issuing 
mpirun) is unlimited, main memory limit - about 14 GB, virtual 
memory limit - about 30 GB. Atlas was compiled for 32-bit integers, 
but it's enough for such N values. Even /proc/sys/kernel/shmmax is 
2^63-1.


What else may be the reason of hangup ?

Mikhail Kuzminskiy
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow
 

 

 
___

Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Building new cluster - estimate

2008-08-06 Thread Mikhail Kuzminsky

In message from Gerry Creager [EMAIL PROTECTED] (Wed, 06 Aug 
2008 09:59:59 -0500):

Robert Kubrick wrote:
Or use solid-state data disks? Does anybody here have experience 
with 
SSD disk in HPC?


Not on OUR budget! ;-)


It was the proposal for journal part only ;-)

SSD/flash disks (for increasing of lifetime) attempt not to erase data 
really - if it's physically possible. But if I use practically whole 
HDD partition for scratch files (and therefore whole SSD) - IMHO it'll 
be impossible not to erase flash RAM. What will be w/SSD disk lifetime 
in that case ?


Mikhail Kuzminsky
Computer Assistance to Chemical Research
Zelinsky Institute of Organic Chemistry
Moscow

  
___

Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Building new cluster - estimate

2008-08-05 Thread Mikhail Kuzminsky

In message from Joshua Baker-LePain [EMAIL PROTECTED] (Tue, 5 Aug 2008 
14:10:33 -0400 (EDT)):

On Tue, 5 Aug 2008 at 8:34pm, Mikhail Kuzminsky wrote

xfs has a rich set of utilities, but AFAIK no defragmentation tools 
(I don't 
know what will be after xfsdump/xfsrestore). But which modern linux


Not true -- see xfs_fsr(8). 


Thanks !!
I didn't look to xfs details many years :-( - it's my mistake.

Back in the IRIX days, it was 
recommended to run this regularly.


I don't remember that xfs_fsr was included in IRIX 6.1-6.4 we used.

Mikhail 

 However, ISTR that the current 
recommendation is as needed, but it really shouldn't be needed.


--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik)

2008-08-01 Thread Mikhail Kuzminsky

In message from Mark Hahn [EMAIL PROTECTED] (Fri, 1 Aug 2008 10:06:17 
-0400 (EDT)):
... Plus , with a lot of those PDUs you can add thermal sensors and 
trigger power 
off on high temperature conditions.
IPMI normally provides all the motherboard's sensors as well.  it 
seems like those are far more relevant than the temp of the PDU...


using lm_sensors is a poor substitute for IPMI.


IMHO the only disadvantage of lm_sensors is the poroblem of building 
of right sensors.conf file.


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow 
___

Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] MPI: over OFED and over IBGD

2008-07-03 Thread Mikhail Kuzminsky

Is there some MPI realization/versions which may be installed one some 
nodes - to work over Mellanox IBGD 1.8.0 (Gold Distribution) IB stack 
and on other nodes - for work w/OFED-1.2 ?


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] MPI: over OFED and over IBGD

2008-07-03 Thread Mikhail Kuzminsky

In message from Gilad Shainer [EMAIL PROTECTED] (Thu, 3 Jul 
2008 09:41:01 -0700):

Mikhail Kuzminsky wrote:

Is there some MPI realization/versions which may be installed 
one some nodes - to work over Mellanox IBGD 1.8.0 (Gold 
Distribution) IB stack and on other nodes - for work w/OFED-1.2 ?



IBGD is out of date, and AFAIK none of the latest versions of the
various MPI were tested against it.


It's clear, but I didn't ask about *LATEST* MPI versions ;-)
 

I would recommend to update the
install to OFED from IBGD, and if you need some help let me know.

Thanks you very much for your help !


If you
must keep it


Yes. There is some russian romance w/the words : You can't 
understand, you can't understand, you can't understand my sorrows

 :-))
 

, than MVAPICH 0.9.6 might work.


Eh, I used 0.9.5 and 0.9.9 :-) Now will see mvapich archives.
Thanks !

Mikhail



Gilad.

___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Strange Opteron 2350 performance: Gaussian-03

2008-06-30 Thread Mikhail Kuzminsky

In message from Bernd Schubert [EMAIL PROTECTED] (Sat, 28 Jun 
2008 19:04:50 +0200):

On Saturday 28 June 2008, Li, Bo wrote:

Hello,
Sorry, I don't have the same applications as you.
Did you compile them with gcc? If gcc, then -o3 can do some 
optimization.

-march=k8 is enough I think.


As Mikhail wrote in his first mail, he uses binaries from Gaussian 
Inc. Can 
gfortran in the mean time compile gaussian? Even if it can, it might 
be a 
problem for publications, since the only officially supported 
compiler is 
pgf77.
Mikhail, do you have the source at all? Due to the different cache 
model of 
the barcelona a recompilation might really help.


No, I have no source :-( I absolutely agree w/you - DFT used is 
cache-friendly. Moreover, this big performance gap corresponds to DFT 
w/FMM (Fast Multipole Method). For usual DFT, Opteron 2350 cores are 
also more slow than Opteron 246, bun only on 33%.   




And you make sure the CPU running at the default frequency. Sometime


Yeah, can you check the scaling governor isn't set to ondemand or 
conservative?


Yes, I looked to frequency many times (as a crazy :-)). There is no 
powersaved daemon, and I looked only 2 Ghz in /proc/cpuinfo :-) 


Yours
Mikhail




Cheers,
Bernd
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Strange Opteron 2350 performance: Gaussian-03

2008-06-28 Thread Mikhail Kuzminsky

I'm runnung a set of quad-core Opteron 2350 benchmarks, in particular 
using Gaussian-03 (binary version from Gaussian, Inc, i.e. translated 
by more old - than current - pgf77 version, for Opteron target).


I compare in particular *one core* of Opteron 2350 w/Opteron 246 
having the same 2 Ghz frequency and the same amount of cache per core 
(512K L2 + 0.25*2 MB L3 for Opteron 2350 is just 1 MB L2 for Opteron 
246). Opteron 246 has even more fast DDR2-667 RAM.


The Gaussian-03 performance in some cases is close for both Opteron's 
(I remember that compilation didn't know about Barcelona !), but for 
very popular DFT method Opteron 2350 cores looks as slow: one job 
gives 33% more bad (than Opteron 246) performance. 

But on standard Gaussian-03 test397.com DFT/B3LYP test: *one* (1) 
Opteron 2350 core run 15667 sec. (both startstop and cpu) vs 8709 sec. 
on (one) Opteron 246 !! 

There is no powersaved daemon, so the frequnecy of Opteron 2350 is 
fixed to 2 Ghz. I reproduced this result twice on Opteron 2350, in 
particular one time using forced good numactl behaviour. I'm 
reproducing it on Opteron 246 again :-) but I have indirect 
confirmation of this timings (based on 2-cpus Opteron 246 parallel 
test).


Yes, AFAIK DFT method is cache-friendly, and more slow L3 cache in 
Opteron 2350 may give more bad performance. But in 1.8 times ??


Any your comments are welcome.

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow





  
___

Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Strange Opteron 2350 performance: Gaussian-03

2008-06-28 Thread Mikhail Kuzminsky

In message from Li, Bo [EMAIL PROTECTED] (Sun, 29 Jun 2008 00:07:07 
+0800):

Hello,
I am afraid there must be something wrong with your experiment.
How did you get the performance? Was your DFT codes running in 
parallel? Any optimization involved?


I was afraid the same, but the results are reproduced twice.

As I wrote in my message:

- there were ONE CORE (one CPU for Opteron 246) runs 
- the optimization was performed for OLD Opteron 246 (because 
Gaussian, Inc do not propose binaries optimized specially for 
Barcelona)


DFT test397 (as any other DFT) is parallelized well, and on Opteron 
246 it gives 1.9 times speedup on 2 CPUs. But I didn't run 2-cores 
parallelized job for Opteron 2350: I was stressed by results obtained 
for 1 core. 

In most of my test, K8L or K10 can beat old opteron at the same 
frequency with about 20% improvement.


Sorry, do you have this on Gaussian-03 and for DFT in particular ? Did 
you compile it on K10 using target=barcelona (i.e. optimized for 
barcelona) ?  


Yours
Mikhail


Regards,
Li, Bo
- Original Message - 
From: Mikhail Kuzminsky [EMAIL PROTECTED]

To: beowulf@beowulf.org
Sent: Saturday, June 28, 2008 11:48 PM
Subject: [Beowulf] Strange Opteron 2350 performance: Gaussian-03


I'm runnung a set of quad-core Opteron 2350 benchmarks, in 
particular 
using Gaussian-03 (binary version from Gaussian, Inc, i.e. 
translated 
by more old - than current - pgf77 version, for Opteron target).


I compare in particular *one core* of Opteron 2350 w/Opteron 246 
having the same 2 Ghz frequency and the same amount of cache per 
core 
(512K L2 + 0.25*2 MB L3 for Opteron 2350 is just 1 MB L2 for Opteron 
246). Opteron 246 has even more fast DDR2-667 RAM.


The Gaussian-03 performance in some cases is close for both 
Opteron's 
(I remember that compilation didn't know about Barcelona !), but for 
very popular DFT method Opteron 2350 cores looks as slow: one job 
gives 33% more bad (than Opteron 246) performance. 

But on standard Gaussian-03 test397.com DFT/B3LYP test: *one* (1) 
Opteron 2350 core run 15667 sec. (both startstop and cpu) vs 8709 
sec. 
on (one) Opteron 246 !! 

There is no powersaved daemon, so the frequnecy of Opteron 2350 is 
fixed to 2 Ghz. I reproduced this result twice on Opteron 2350, in 
particular one time using forced good numactl behaviour. I'm 
reproducing it on Opteron 246 again :-) but I have indirect 
confirmation of this timings (based on 2-cpus Opteron 246 parallel 
test).


Yes, AFAIK DFT method is cache-friendly, and more slow L3 cache in 
Opteron 2350 may give more bad performance. But in 1.8 times ??


Any your comments are welcome.

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow





  
___

Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Strange Opteron 2350 performance: Gaussian-03

2008-06-28 Thread Mikhail Kuzminsky

In message from Li, Bo [EMAIL PROTECTED] (Sun, 29 Jun 2008 00:37:12 
+0800):

The problem is present just for Gaussian-03 binary version we have.

If I compile myself Linpack, for example, Opteron 2350 core is faster.
Yes - of course it's Linux x86-64, SuSE 10.3
Powersave daemon is no run.

Mikhail


Hello,
Sorry, I don't have the same applications as you.
Did you compile them with gcc? If gcc, then -o3 can do some 
optimization.

-march=k8 is enough I think.
And you make sure the CPU running at the default frequency. Sometime 
Powernow is active as default.

And BTW, what's your platform? Linux? Which release? X86_64?
Regards,
Li, Bo
- Original Message - 
From: Mikhail Kuzminsky [EMAIL PROTECTED]

To: Li, Bo [EMAIL PROTECTED]
Cc: beowulf@beowulf.org
Sent: Sunday, June 29, 2008 12:23 AM
Subject: Re: [Beowulf] Strange Opteron 2350 performance: Gaussian-03


In message from Li, Bo [EMAIL PROTECTED] (Sun, 29 Jun 2008 
00:07:07 
+0800):

Hello,
I am afraid there must be something wrong with your experiment.
How did you get the performance? Was your DFT codes running in 
parallel? Any optimization involved?


I was afraid the same, but the results are reproduced twice.

As I wrote in my message:

- there were ONE CORE (one CPU for Opteron 246) runs 
- the optimization was performed for OLD Opteron 246 (because 
Gaussian, Inc do not propose binaries optimized specially for 
Barcelona)


DFT test397 (as any other DFT) is parallelized well, and on Opteron 
246 it gives 1.9 times speedup on 2 CPUs. But I didn't run 2-cores 
parallelized job for Opteron 2350: I was stressed by results 
obtained 
for 1 core. 

In most of my test, K8L or K10 can beat old opteron at the same 
frequency with about 20% improvement.


Sorry, do you have this on Gaussian-03 and for DFT in particular ? 
Did 
you compile it on K10 using target=barcelona (i.e. optimized for 
barcelona) ?  


Yours
Mikhail


Regards,
Li, Bo
- Original Message - 
From: Mikhail Kuzminsky [EMAIL PROTECTED]

To: beowulf@beowulf.org
Sent: Saturday, June 28, 2008 11:48 PM
Subject: [Beowulf] Strange Opteron 2350 performance: Gaussian-03


I'm runnung a set of quad-core Opteron 2350 benchmarks, in 
particular 
using Gaussian-03 (binary version from Gaussian, Inc, i.e. 
translated 
by more old - than current - pgf77 version, for Opteron target).


I compare in particular *one core* of Opteron 2350 w/Opteron 246 
having the same 2 Ghz frequency and the same amount of cache per 
core 
(512K L2 + 0.25*2 MB L3 for Opteron 2350 is just 1 MB L2 for Opteron 
246). Opteron 246 has even more fast DDR2-667 RAM.


The Gaussian-03 performance in some cases is close for both 
Opteron's 
(I remember that compilation didn't know about Barcelona !), but for 
very popular DFT method Opteron 2350 cores looks as slow: one job 
gives 33% more bad (than Opteron 246) performance. 

But on standard Gaussian-03 test397.com DFT/B3LYP test: *one* (1) 
Opteron 2350 core run 15667 sec. (both startstop and cpu) vs 8709 
sec. 
on (one) Opteron 246 !! 

There is no powersaved daemon, so the frequnecy of Opteron 2350 is 
fixed to 2 Ghz. I reproduced this result twice on Opteron 2350, in 
particular one time using forced good numactl behaviour. I'm 
reproducing it on Opteron 246 again :-) but I have indirect 
confirmation of this timings (based on 2-cpus Opteron 246 parallel 
test).


Yes, AFAIK DFT method is cache-friendly, and more slow L3 cache in 
Opteron 2350 may give more bad performance. But in 1.8 times ??


Any your comments are welcome.

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow





  
___

Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf




___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Strange Opteron 2350 performance: Gaussian-03

2008-06-28 Thread Mikhail Kuzminsky

In message from Joe Landman [EMAIL PROTECTED] (Sat, 28 
Jun 2008 14:48:02 -0400):
  This is possible, depending upon the compiler used.  Though I have 
to 
admit that I find it odd that it would be the case within the Opteron 
family and not between Opteron and Xeon.


  Intel compilers used to (haven't checked 10.1) switch between fast 
(SSE*) and slow (x87 FP) paths as a function of a processor version 
string.  If this is an old Intel compiler built code, this is 
possible that the code paths may be different, though as noted, I 
would find that surprising if this were the case within the Opteron 
family.


Well, I thought about (absense of) using of SSE in binary Gaussian 03 
Rev.C02 version
I used, but even if x87-codes were really generated by pgf77 - why 
this x87-based codes gives such high performance on Opteron 246 in 
comparison w/Opteron 2350 core ? On both CPUs I ran the same binary 
Gaussian codes !


  Modern PGI compilers (suggested default for Gaussian-03 last I 
checked) have the ability to do this as well, though I don't know how 
they implement it (capability testing hopefully?)


  Out of curiousity, how does streams run on both systems? 


I ran stream on Opteron 242 and 244 few years ago. The scalability and 
the troughput itself was OK. Currently I ran stream on my Opteron 
2350-based dual-socket server. In accordance w/more fast DDR2-667 I 
obtained more high throughput. I reproduced in particular 8-cores 
result presented in McCalpin's table (sent from AMD), and some data 
presented early on our Beowulf maillist. 

(BTW, there is one bad thing for stream on this server - the 
corresponding data are absent in McCalpin's table: the throughput is 
scaled good from 1 to 2 OpenMP threads, and gives good result for 8 
threads, but the throughput for 4 threads is about the same as for 2 
threads. The reason is, IMHO, that for 8 threads RAM is allocated by 
kernel in both nodes, but for 4 threads the RAM allocated is placed in 
one node, and 4 threads have bad competition for memory access).   

Taking into account that Gaussian-03 was bad on Opteron 2350 core - in 
sequential run, Opteron 2350 RAM gives it only pluses in comparison 
w/Opteron 246. I didn't run stream on Opteron 246, but it's clear for 
me.


Also, it 
is 
possible, with a larger cache, that you might be running into some 
odd cache effects (tlb/page thrashing).  But DFTs are usually small 
and thus sensitive to cache size.


  You might be able to instrument the run within a papi wrapper, and 
see if you observe a large number of cache/tlb flushes for some 
reason.


  On a related note:  are you using a stepping before B3 of 2350? 
That 
could impact performance, if you have the patch in place or have the 
tlb/cache turned off in bios (some MB makers created a patch to do 
this).


Gaussian-03 fails in link302 on Barcelona B2 because of this error. I 
use stepping B3. 


Yours
Mikhail



Joe


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web  : http://www.scalableinformatics.com
   http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Again about NUMA (numactl and taskset)

2008-06-26 Thread Mikhail Kuzminsky

In message from Håkon Bugge [EMAIL PROTECTED] (Thu, 26 Jun 2008 
11:16:17 +0200):


Numastat statistics before Gaussian-03 run (OpenMP, 8 threads, 8 
cores,
requires  512 Mbytes shared memory plus something more, may be fitted 
in memory of any node - I have 8 GB per node, 6- GB free in node0 and 
7+ GB free in node1)


node0:
numa_hit 14594588
numa_miss 0
numa_foreign 0
interleave_hit 14587
local_node 14470168
other_node 124420

node1:
numa_hit 11743071
numa_miss 0
numa_foreign 0
interleave_hit 14584
local_node 11727424
other_node 15647
---
Statistics after run:

node0:
numa_hit 15466972
numa_miss 0
numa_foreign 0
interleave_hit 14587
local_node 15342552
other_node 124420

node1:
numa_hit 12960452
numa_miss 0
numa_foreign 0
interleave_hit 14584
local_node 12944805
other_node 15647
---

Unfortunately I don't know, what exactly means this lines !! :-(
(BTW, do somebody know ?!)

But intuitive it looks (taking into account the increase of
numa_hit and local_node values), that the allocation of RAM was 
performed from BOTH nodes (and more RAM was allocated from node1 
memory - node1 had initially more free RAM).


It is in opposition w/my expectations of continuous RAM allocation 
from the RAM of one node !


Mikhail Kuzminsky,
Computer Assistance to Chemical Research
Zelinsky Institute of Organic Chemistry
Moscow 







At 18:34 25.06.2008, Mikhail Kuzminsky wrote:
Let me assume now the following situation. I have OpenMP-parallelized 
application which have the number of processes equal to number of CPU 
cores per server. And let me assume that this application uses not 
too more virtual memory, so all the real memory used may be placed in 
RAM of *one* node.
It's not the abstract question - a lot of Gaussian-03 jobs we have 
fit to this situation, and all the 8 cores for dual socket quad core 
Opteron server will be well loaded.


Is it right that all the application memory (w/o using of numactl) 
will be allocated (by Linux kernel) in *one* node ?


Guess the answer is, it depends. The memory will be allocated on the 
node where the thread first touching it is running. But you could use 
numastat to investigate the issue.



Håkon


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Again about NUMA (numactl and taskset)

2008-06-25 Thread Mikhail Kuzminsky

Let me assume now the following situation. I have OpenMP-parallelized 
application which have the number of processes equal to number of CPU 
cores per server. And let me assume that this application uses not too 
more virtual memory, so all the real memory used may be placed in RAM 
of *one* node. 

It's not the abstract question - a lot of Gaussian-03 jobs we have fit 
to this situation, and all the 8 cores for dual socket quad core 
Opteron server will be well loaded.


Is it right that all the application memory (w/o using of numactl) 
will be allocated (by Linux kernel) in *one* node ? Then only one 
memory controller will be used. 

OK, then if I have the same server but w/2 times more small memory 
size (it's enough for run of this Gaussian-03 job !) and DIMMs are 
populating both nodes, then the performance of this server will be 
higher ! - because both memory controllers (and therefore more memory 
channels) will work simultaneously.


Is it right - that more cheap server will have higher performance for 


like cases ??

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Timers and TSC behaviour on SMP/x86

2008-06-24 Thread Mikhail Kuzminsky

As I remember, TSCs in SMP/x86 are synchronized by Linux kernels at 
the boot process. 

But the only message (about TSC) I see after Linux boot in dmesg (or 
/var/log/messages) in SuSE 10.3 w/2.6.22 default kernel on quad-core 
dual socket Opteron serever is


Marking TSC unstable due to TSCs unsynchronized

Does it means that RDTSC-based timer (I use it for microbenchmarks) 
will give wrong results ? :-(


Some additional information,
according Software Optimization Guide for AMD Familly 10h Processors
(quad-core) from Apr 4th, 2008:

Early each AMD core had own TSC. Now quad-core processors have one 
common clock source in NorthBridge (BTW, is it in this case integrated 
into CPU chip - i.e. includes integrated memory controller and support 
of HT links ? - M.K.) - for all the TSCs of CPUs (cores ? - M.K.).


The synchronization accuracy should be few tens cycles.

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow  




  
___

Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Again about NUMA (numactl and taskset)

2008-06-23 Thread Mikhail Kuzminsky


I'm testing my 1st dual-socket quad-core Opteron 2350-based server.
Let me assume that the RAM used by kernel and system processes is 
zero, there is no physical RAM fragmentation, and the affinity of 
processes to CPU cores is maintained. I assume also that both the 
nodes are populated w/equal number of the same DIMMs.


If I run thread- parallelized (for example, w/OpenMP) application w/8 
threads (8 = number of server CPU cores), the ideal case for all the 
(equal) threads is: the shared memory used by each of 2 CPUs (by 
each of 2 processes quads) should be divided equally between 2 
nodes, and the local memory used by each process should be mapped 
analogically. 

Theoretically like ideal case may be realized if my application (8 
threads) uses practically all the RAM and uses only shared memory (I 
assume here also that all the RAM addresses have the same load, and 
the size of program codes is zero :-) ).


The questions are
1) Is there some way to distribute analogously the local memory of 
threads (I assume that it have the same size for each thread) using 
reasonable NUMA allocation ?


2) Is it right that using of numactl for applications may gives 
improvements of performance for the following case:
the number of application processes is equal to the number of cores of 
one CPU *AND* the necessary (for application) RAM amount may be placed 
on one node DIMMs (I assume that RAM is allocated continously).


What will be w/performance (at numactl using) for the case if RAM size 
required is higher than RAM available per one node, and therefore the 
program will not use the possibility of (load balanced) simultaneous 
using of memory controllers on both CPUs ? (I also assume also that 
RAM is allocated continously).


3) Is there some reason to use things like
mpirun -np N /usr/bin/numactl numactl_parameters  my_application   ?

4) If I use malloc()  and don't use numactl, how to understand - from 
which node Linux will begin the real memory allocation ? (I remember 
that I assume that all the RAM is free) And how to understand  where 
are placed the DIMMs which will corresponds to higher RAM addresses or 
lower RAM addresses ?


5) In which cases is it reasonable to switch on Node memory 
interleaving (in BIOS) for the application which uses more memory 
than is presented on the node ?  

And BTW: if I use 
taskset -c CPU1,CPU2, ... program_file
and the program_file creates some new processes, will all this 
processes run only on the same CPUs defined in taskset command ?


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center,
Zelinsky Institute of Organic Chemistry
Moscow  

  

 
___

Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Again about NUMA (numactl and taskset)

2008-06-23 Thread Mikhail Kuzminsky

In message from Vincent Diepeveen [EMAIL PROTECTED] (Mon, 23 Jun 2008 
18:41:21 +0200):

I would add to this:

how sure are we that a process (or thread) that allocated and 
initialized and writes to memory at a single specific memory node,

also keeps getting scheduled at a core on that memory node?

It seems to me that sometimes (like every second or so) threads jump 
from 1 memory node to another. I could be wrong,

but i certainly have that impression with the linux kernels.


Dear Vincent,
do I understand you correctly that simple using of taskset is not 
enough to prevent process migration to other core/node ??


Mikhail



That said, it has improved a lot, now all we need is a better 
compiler for linux. GCC is for my chessprogram generating an
executable that gets  22% slower positions per second than visual c++ 
2005 is.


Thanks,
Vincent



On Jun 23, 2008, at 4:01 PM, Mikhail Kuzminsky wrote:


I'm testing my 1st dual-socket quad-core Opteron 2350-based server.
Let me assume that the RAM used by kernel and system processes is  
zero, there is no physical RAM fragmentation, and the affinity of  
processes to CPU cores is maintained. I assume also that both the  
nodes are populated w/equal number of the same DIMMs.


If I run thread- parallelized (for example, w/OpenMP) application w/ 
8 threads (8 = number of server CPU cores), the ideal case for all  
the (equal) threads is: the shared memory used by each of 2 CPUs  
(by each of 2 processes quads) should be divided equally between  
2 nodes, and the local memory used by each process should be mapped 


analogically.
Theoretically like ideal case may be realized if my application (8  
threads) uses practically all the RAM and uses only shared memory  
(I assume here also that all the RAM addresses have the same load,  
and the size of program codes is zero :-) ).


The questions are
1) Is there some way to distribute analogously the local memory of  
threads (I assume that it have the same size for each thread) using 


reasonable NUMA allocation ?

2) Is it right that using of numactl for applications may gives  
improvements of performance for the following case:
the number of application processes is equal to the number of cores 

of one CPU *AND* the necessary (for application) RAM amount may be  
placed on one node DIMMs (I assume that RAM is allocated  
continously).


What will be w/performance (at numactl using) for the case if RAM  
size required is higher than RAM available per one node, and  
therefore the program will not use the possibility of (load  
balanced) simultaneous using of memory controllers on both CPUs ?  
(I also assume also that RAM is allocated continously).


3) Is there some reason to use things like
mpirun -np N /usr/bin/numactl numactl_parameters  my_application 
 ?


4) If I use malloc()  and don't use numactl, how to understand -  
from which node Linux will begin the real memory allocation ? (I  
remember that I assume that all the RAM is free) And how to  
understand  where are placed the DIMMs which will corresponds to  
higher RAM addresses or lower RAM addresses ?


5) In which cases is it reasonable to switch on Node memory  
interleaving (in BIOS) for the application which uses more memory  
than is presented on the node ?

And BTW: if I use taskset -c CPU1,CPU2, ... program_file
and the program_file creates some new processes, will all this  
processes run only on the same CPUs defined in taskset command ?


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center,
Zelinsky Institute of Organic Chemistry
Moscow

 ___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit  
http://www.beowulf.org/mailman/listinfo/beowulf




___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] SuperMicro and lm_sensors

2008-06-19 Thread Mikhail Kuzminsky

In message from Bernard Li [EMAIL PROTECTED] (Thu, 19 Jun 2008 
11:28:08 -0700):

Hi David:

On Thu, Jun 19, 2008 at 6:50 AM, Lombard, David N
[EMAIL PROTECTED] wrote:

Did you look for /proc/acpi/thermal_zone/*/temperature  The glob is 
for
your BIOS-defined ID.  If it does exist, that's the value that 
drives

/proc/acpi/thermal_zone/*/trip_points

See also /proc/acpi/thermal_zone/*/polling_frequency


I have always wondered about /proc/acpi/thermal_zone.  I noticed that
on some servers, the files exist, but on others, that directory is
empty.  I guess this is dependent on whether the BIOS exposes the
information to the kernel?  Or are there modules that I need to
install to get it working?


AFAIK it depends from BIOS. On my Tyan S2932 w/last BIOS version this 
directory is empty.


Mikhail




Thanks,

Bernard
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Tyan S2932 and lm_sensors

2008-06-18 Thread Mikhail Kuzminsky

Sorry, do somebody have correct sensors.conf file for Tyan S2932 
motherboard ? There is no lm_sensors configuration file for this mobos 
on Tyan site :-(


Yours
Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow 
___

Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Tyan S2932 and lm_sensors

2008-06-18 Thread Mikhail Kuzminsky

In message from Seth Bardash [EMAIL PROTECTED] (Wed, 18 
Jun 2008 10:32:17 -0600):
ftp://ftp.tyan.com/softwave/lms/2932.sensors.conf 



Seth Bardash

Integrated Solutions and Systems
1510 Old North Gate Road
Colorado Springs, CO 80921

719-495-5866
719-495-5870 Fax
719-337-4779 Cell

http://www.integratedsolutions.org

Failure can not cope with knowledge and perseverance! 


-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED]

On Behalf Of Mikhail Kuzminsky
Sent: Wednesday, June 18, 2008 9:26 AM
To: beowulf@beowulf.org
Subject: [Beowulf] Tyan S2932 and lm_sensors

Sorry, do somebody have correct sensors.conf file for Tyan S2932 
motherboard ? There is no lm_sensors configuration file for this 
mobos


on Tyan site :-(

Yours
Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow 
___

Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
No virus found in this incoming message.
Checked by AVG. 
Version: 8.0.100 / Virus Database: 270.4.0/1507 - Release Date:

6/18/2008 7:09 AM



___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Tyan S2932 and lm_sensors

2008-06-18 Thread Mikhail Kuzminsky

In message from Seth Bardash [EMAIL PROTECTED] (Wed, 18 
Jun 2008 10:32:17 -0600):
ftp://ftp.tyan.com/softwave/lms/2932.sensors.conf 



Seth Bardash


Thank you very much !!
It's strange, but I didn't find this file on Tyan archive!

Mikhail



Integrated Solutions and Systems
1510 Old North Gate Road
Colorado Springs, CO 80921

719-495-5866
719-495-5870 Fax
719-337-4779 Cell

http://www.integratedsolutions.org

Failure can not cope with knowledge and perseverance! 


-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED]

On Behalf Of Mikhail Kuzminsky
Sent: Wednesday, June 18, 2008 9:26 AM
To: beowulf@beowulf.org
Subject: [Beowulf] Tyan S2932 and lm_sensors

Sorry, do somebody have correct sensors.conf file for Tyan S2932 
motherboard ? There is no lm_sensors configuration file for this 
mobos


on Tyan site :-(

Yours
Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow 
___

Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
No virus found in this incoming message.
Checked by AVG. 
Version: 8.0.100 / Virus Database: 270.4.0/1507 - Release Date:

6/18/2008 7:09 AM



___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Powersave on Beowulf nodes

2008-06-14 Thread Mikhail Kuzminsky

What is about using of powersaved (and dbus and HAL daemons) on 
Beowulf nodes ?


Currently I installed SuSE 10.3 where all the corresponding daemons 
are running (by default) at the runlevel=3. I simple added issuing of


powersave -f
 
at the end of booting. 

 /proc/acpi/thermal_zone/ is empty, and powersave can't give me 
temperature and FANs information. I don't see now any serious 
advantages of powersaved daemon using in performance mode (using 
performance scheme). 

We have many jobs in SGE at every time moment, and underload situation 
(where it's reasonable to decrease CPUs frequency) is not the our 
danger :-)  So I'm thinking about simple stopping of all the 
corresponding daemons.


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] size of swap partition

2008-06-10 Thread Mikhail Kuzminsky

In message from Mark Hahn [EMAIL PROTECTED] (Tue, 10 Jun 2008 
00:58:12 -0400 (EDT)):
... 

for instance, you can always avoid OOM with the vm.overcommit_memory=2
sysctl (you'll need to tune vm.overcommit_ratio and the amount of swap
to get the desired limits.)  in this mode, the kernel tracks how much 
VM
it actually needs (worst-case, reflected in Committed_AS in 
/proc/meminfo)

and compares that to a commit limit that reflects ram and swap.

if you don't use overcommit_memory=2, you are basically borrowing VM
space in hopes of not needing it.  that can still be reasonable, 
considering
how often processes have a lot of shared VM, and how many processes 
allocate but never touch lots of pages.  but you have to ask yourself:

would I like a system that was actually _using_ 16 GB of swap?  if you
have 16x disks, perhaps, but 16G will suck if you only have 1 disk.
at least for overcommit_memory != 2, I don't see the point of 
configuring
a lot of swap, since the only time you'd use it is if you were 
thrashing.

sort of a quality of life argument.


But what are the reccomendations of modern praxis ?


it depends a lot on the size variance of your jobs, as well as their 
real/virtual ratio.  the kernel only enforces RLIMIT_AS
(vsz in ps),assuming a 2.6 kernel - I forget whether 2.4 did 
RLIMIT_RSS or not.


if you use overcommit_memory=2, your desired max VM size determines 
the amount of swap.  otherwise, go with something modest - memory size

or so.  but given that the smallest reasonable single disk these days
is probably about 320GB, it's hard to justify being _too_ tight.

:-) The disks we use in nodes is SATA WD/10K RPM w/70 GB :-))

We didn't set overcommit_memory=2, but really use strongly restricted 
scheduling police for SGE batch jobs using only few applications. We 
have only batch jobs (no interactive), moreover - practically only 
*long batch jobs*. As a result we have summary VM (requested per node) 
equal (or lower) than RAM. There is practically zero swap activity. 
The only exclusion are (seldom executed) small test jobs, 
non-parallelized, mainly for check of input data. They use small RAM 
amount. So it looks for me that I may set even lower than 1.5*RAM swap 
size (I think RAM+4G = 20G will be enough).


In message from Walid [EMAIL PROTECTED] (Tue, 10 Jun 2008 
19:27:43 +0300):

Hi,
For an 8GB dual socket quad core node, choosing in the kick start
file --recommended instead of specifying size RHEL5 allocates 1GB of
memory. our developers say that they should not swap as this will
cause an overhead, and they try to avoid it as much as possible


OpenSuSE 10.3 recommends swap size=2 GB only, but I don't know, 
performs SuSE inst software some estimation of server RAM or no. 


Yours
Mikhail

___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] size of swap partition

2008-06-09 Thread Mikhail Kuzminsky

A lot of time ago it was formulated simple rule for swap partition 
size

(equal to main memory size).

Currently we all have relative large RAM on the nodes (typically, I 
beleive, it is 2 or more GB per core; we have 16 GB per dual-socket 
quad-core Opteron node). What is typical modern swap size today?


I understand that it depends from applications ;-) We, in particular, 
practically don't have jobs which run out-of-RAM. For single core 
dual-socket Opteron nodes w/4GB RAM per node and molecular modelling 
workload we used 4 GB swap partition.


But what are the reccomendations of modern praxis ?

Mikhail Kuzminksy
Computer Assistance to Chemical Research Center
Zelinsky Inst. of Organic Chemistry
Moscow   
___

Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Barcelona hardware error: how to detect

2008-06-05 Thread Mikhail Kuzminsky

How is possible to detect, that particular AMD Barcelona CPU has - or 
doesn't have - known hardware error problem ?


To be more exact, Rev. B2 of Opteron 2350 - is it for CPU stepping 
w/error or w/o error ?


Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Inst. of Organic Chemistry
Moscow
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-05 Thread Mikhail Kuzminsky

In message from Mark Hahn [EMAIL PROTECTED] (Thu, 5 Jun 2008 11:57:28 
-0400 (EDT)):
To be more exact, Rev. B2 of Opteron 2350 - is it for CPU stepping 
w/error or 
w/o error ?


AMD, like Intel, does a reasonable job of disclosing such info:

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/41322.PDF

the well-known problem is erattum 298, I think, and fixed in B3.


Yes, this AMD errata document says that in B3 revision the error will 
be fixed. I heard that new CPUs w/o TLB+L3 error are shipped now,
but are this CPUs really B3 or may be have some more new release ?


Mikhail
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-05 Thread Mikhail Kuzminsky

In message from Mark Hahn [EMAIL PROTECTED] (Thu, 5 Jun 2008 13:30:57 
-0400 (EDT)):

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/41322.PDF

the well-known problem is erattum 298, I think, and fixed in B3.


Yes, this AMD errata document says that in B3 revision the error 
will be 
fixed.


I believe the absence of 'x' in the B3 column of the table on p 15
means that it _is_ fixed in B3.


I received just now some preliminary data about Gaussian-03 run 
problems w/B2 and about absence of this problems w/B3.


Yours
Mikhail

___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-05 Thread Mikhail Kuzminsky

In message from Mark Hahn [EMAIL PROTECTED] (Thu, 5 Jun 2008 13:55:01 
-0400 (EDT)):

I believe the absence of 'x' in the B3 column of the table on p 15
means that it _is_ fixed in B3.


I received just now some preliminary data about Gaussian-03 run 
problems w/B2 
and about absence of this problems w/B3.


I'm mystified by this: B2 was broken, so using it without the bios 
workaround is just a mistake or masochism.  the workaround _did_ 
apparently have performance implications, but that's why B3 exists...


do you mean you know of G03 problems on B2 systems which are operating
_with_ the workaround?


I don't know exactly, but I think the crash was under absence of 
workaround, because I was not informed that there was some kernel 
patches or BIOS changes. This was interesting for me also, because I 
have no information how this hardware problem may be affected in the 
real life. 
 
Mikhail

___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-05 Thread Mikhail Kuzminsky

In message from Jason Clinton [EMAIL PROTECTED] 
(Thu, 5 Jun 2008 13:16:33 -0500):
On Thu, Jun 5, 2008 at 1:09 PM, Mikhail Kuzminsky [EMAIL PROTECTED] 
wrote:


In message from Mark Hahn [EMAIL PROTECTED] (Thu, 5 Jun 2008 
13:55:01

-0400 (EDT)):


I'm mystified by this: B2 was broken, so using it without the bios
workaround is just a mistake or masochism.  the workaround _did_ 
apparently

have performance implications, but that's why B3 exists...

do you mean you know of G03 problems on B2 systems which are 
operating

_with_ the workaround?



I don't know exactly, but I think the crash was under absence of
workaround, because I was not informed that there was some kernel 
patches or

BIOS changes. This was interesting for me also, because I have no
information how this hardware problem may be affected in the real 
life.

 Mikhail



The B2 BIOS work-around is to disable the L3 cache which gives you a 
10-20%

performance hit with no reduction in power consumption.

The kernel patch is very extensive and, last I heard, under NDA. AMD 
has

said publicly that the patch gives you a 1-2% performance hit.


This URL is old, but may give some information:

https://www.x86-64.org/pipermail/discuss/2007-December/010260.html

Mikhail

___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Nvidia, cuda, tesla and... where's my double floating point?

2008-05-04 Thread Mikhail Kuzminsky

In message from Ricardo Reis [EMAIL PROTECTED] (Fri, 2 May 2008 
14:05:25 +0100 (WEST)):


 Does anyone knows if/when there will be double floating point on 
those 
little toys from nvidia?
Next generation Tesla, but I don't know when. Or use AMD  
FireStream 9170 instead :-)
 
Mikhail Kuzminsky

Computer Assistance to Chemical Research Center
Zelisnky Inst. of Organic Chemistry
Moscow 
 



 greets,

 Ricardo Reis

 'Non Serviam'

 PhD student @ Lasef
 Computational Fluid Dynamics, High Performance Computing, 
Turbulence

 http://www.lasef.ist.utl.pt

 

 Cultural Instigator @ Rádio Zero
 http://www.radiozero.pt

 http://www.flickr.com/photos/rreis/


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] IB DDR: mvapich2 vs mvapich performance

2008-04-24 Thread Mikhail Kuzminsky

In message from Eric Thibodeau [EMAIL PROTECTED] (Wed, 23 Apr 2008 
16:48:04 -0400):

Mikhail Kuzminsky wrote:
In message from Greg Lindahl [EMAIL PROTECTED] (Wed, 23 Apr 2008 
00:36:44 -0700):

On Wed, Apr 23, 2008 at 07:04:51AM +0400, Mikhail Kuzminsky wrote:

Is this throughput difference the result of MPI-2 vs MPI 
implementation or should I beleive that this difference (about 4% 
for my mvapich vs mvapich2 at SC'07 ) is not significant  - in the 
sense that it is simple because of some measurement errors 
(inaccuracies)? 


I dunno, does it help your real applications?


Significantly - of course, not :-) But our application is really 
bounded by throughput !

Throughput or latency?


Throughput ! osu_bw was interesting just for bandwidth ;-)

Mikhail



Yours
Mikhail


-- greg



Eric


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

1 2 >

1 - 100 of 119 matches

Mail list logo