from:"Scott Atchley"

Re: [Beowulf] immersion

2024-04-07 Thread Scott Atchley

On Sun, Mar 24, 2024 at 2:38 PM Michael DiDomenico 
wrote:

> i'm curious if others think DLC might hit a power limit sooner or later,
> like Air cooling already has, given chips keep climbing in watts.
>

What I am worried about is power per blade/node.  The Cray EX design used
in Frontier has a limit per blade. Frontier and El Cap have two nodes per
blade. Aurora, which uses more power, only has one node per blade. I
imagine that ORv3 racks will have similar issues.

We can remove 6KW per blade today and I am confident that we can remove
some more. That said, we could reach a point where a blade might just be a
single processor.
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] immersion

2024-03-24 Thread Scott Atchley

On Sun, Mar 24, 2024 at 2:38 PM Michael DiDomenico 
wrote:

> thanks, there's some good info in there.  just to be clear to others that
> might chime in i'm less interested in the immersion/dlc debate, then
> getting updates from people that have sat on either side of the fence.
> dlc's been around awhile and so has immersion, but what i can't get from
> sales glossy's is real world maintenance over time.
>
> being in the DoD space, i'm well aware of the HPE stuff, but they're also
> whats making me look at other stuff.  i'm not real keen on +100kw racks,
> there are many safety concerns with that much amperage in a single
> cabinet.  not to mention all that custom hardware comes at stiff cost and
> in my opinion doesn't have a good ROI if you're not buying 100's of racks
> worth of it.  but your space constrained issue is definitely one i'm
> familiar with.  our new space is smaller then i think we should build, but
> we're also geography constrained.
>
> the other info i'm seeking is futures, DLC seems like a right now solution
> to ride the AI wave.  i'm curious if others think DLC might hit a power
> limit sooner or later, like Air cooling already has, given chips keep
> climbing in watts.  and maybe it's not even a power limit per se, but DLC
> is pretty complicated with all the piping/manifolds/connectors/CDU's, does
> there come a point where its just not worth it unless it's a big custom
> solution like the HPE stuff
>

The ORv3 rack design's maximum power is the number of power shelves times
the power per shelf. Reach out to me directly at  @ ornl.gov
and I can connect you with some vendors.


>
>
> On Sun, Mar 24, 2024 at 1:46 PM Scott Atchley 
> wrote:
>
>> On Sat, Mar 23, 2024 at 10:40 AM Michael DiDomenico <
>> mdidomeni...@gmail.com> wrote:
>>
>>> i'm curious to know
>>>
>>> 1 how many servers per vat or U
>>> 2 i saw a slide mention 1500w/sqft, can you break that number into kw
>>> per vat?
>>> 3 can you shed any light on the heat exchanger system? it looks like
>>> there's just two pipes coming into the vat, is that chilled water or oil?
>>> is there a CDU somewhere off camera?
>>> 4 that power bar in the middle is that DUG custom?
>>> 5 any stats on reliability?  like have you seen a decrease in the hw
>>> failures?
>>>
>>> are you selling the vats/tech as a product?  can i order one? :)
>>>
>>> since cpu's are pushng 400w/chip, nvidia is teasing 1000w/chip coming in
>>> the near future, and i'm working on building a new site, i'm keenly
>>> interested in thoughts on DLC or immersion tech from anyone else too
>>>
>>
>> As with all things in life, everything has trade-offs.
>>
>> We have looked at immersion at ORNL and these are my thoughts:
>>
>> *Immersion*
>>
>>- *Pros*
>>   - Low Power Usage Efficiency (PUE) - as low as 1.03. This means
>>   that you only spend $0.03 per dollar to cool a system for each $1.00 
>> that
>>   the system consumes in power. In contrast, air-cooled data centers can
>>   range from 1.30 to 1.60 or higher.
>>   - No special racks - can install white box servers and remove the
>>   fans.
>>   - No cooling loops - no fittings that can leak, get kinked, or
>>   accidentally clamped off.
>>   - No bio-growth issues
>>- *Cons*
>>- Low power density - take a vertical rack and lay it sideways. DLC
>>   allows the same power density with the rack being vertical.
>>   - Messy - depends on the fluid, but oil is common and cheap. Many
>>   centers build a crane to hoist out servers and then let them drip dry 
>> for a
>>   day before servicing.
>>   - High Mean-Time-To-Repair (MTTR) - unless you have two cranes,
>>   you cannot insert a new node until the old one has dripped dry and been
>>   removed from the crane.
>>   - Some solutions can be expensive and/or lead to part failures due
>>   to residue build up on processor pins.
>>
>> *Direct Liquid Cooling (DLC)*
>>
>>- *Pros*
>>   - Low PUE compared to air-cooled. Depends on how much water
>>   capture. Summit uses hybrid DLC (water for CPUs and GPUs and air for 
>> DIMMs,
>>   NICs, SSDs, and power supply) with ~22°C water. Summit's PUE can range 
>> from
>>   1.03 to 1.10 depending on the time of year. Frontier, on the other 
>> hand, is
>>   100% DLC (no fans in the compute racks) with 32°C water. Frontier's 
>> PUE can
>>   range from 1.03 to 1.06 depending on the time of year.

Re: [Beowulf] immersion

2024-03-24 Thread Scott Atchley

On Sat, Mar 23, 2024 at 10:40 AM Michael DiDomenico 
wrote:

> i'm curious to know
>
> 1 how many servers per vat or U
> 2 i saw a slide mention 1500w/sqft, can you break that number into kw per
> vat?
> 3 can you shed any light on the heat exchanger system? it looks like
> there's just two pipes coming into the vat, is that chilled water or oil?
> is there a CDU somewhere off camera?
> 4 that power bar in the middle is that DUG custom?
> 5 any stats on reliability?  like have you seen a decrease in the hw
> failures?
>
> are you selling the vats/tech as a product?  can i order one? :)
>
> since cpu's are pushng 400w/chip, nvidia is teasing 1000w/chip coming in
> the near future, and i'm working on building a new site, i'm keenly
> interested in thoughts on DLC or immersion tech from anyone else too
>

As with all things in life, everything has trade-offs.

We have looked at immersion at ORNL and these are my thoughts:

*Immersion*

   - *Pros*
  - Low Power Usage Efficiency (PUE) - as low as 1.03. This means that
  you only spend $0.03 per dollar to cool a system for each $1.00 that the
  system consumes in power. In contrast, air-cooled data centers can range
  from 1.30 to 1.60 or higher.
  - No special racks - can install white box servers and remove the
  fans.
  - No cooling loops - no fittings that can leak, get kinked, or
  accidentally clamped off.
  - No bio-growth issues
   - *Cons*
   - Low power density - take a vertical rack and lay it sideways. DLC
  allows the same power density with the rack being vertical.
  - Messy - depends on the fluid, but oil is common and cheap. Many
  centers build a crane to hoist out servers and then let them
drip dry for a
  day before servicing.
  - High Mean-Time-To-Repair (MTTR) - unless you have two cranes, you
  cannot insert a new node until the old one has dripped dry and
been removed
  from the crane.
  - Some solutions can be expensive and/or lead to part failures due to
  residue build up on processor pins.

*Direct Liquid Cooling (DLC)*

   - *Pros*
  - Low PUE compared to air-cooled. Depends on how much water capture.
  Summit uses hybrid DLC (water for CPUs and GPUs and air for DIMMs, NICs,
  SSDs, and power supply) with ~22°C water. Summit's PUE can range
from 1.03
  to 1.10 depending on the time of year. Frontier, on the other
hand, is 100%
  DLC (no fans in the compute racks) with 32°C water. Frontier's PUE can
  range from 1.03 to 1.06 depending on the time of year. Both PUEs include
  the pumps for the water towers and to move the water between the Central
  Energy Plant and the data center.
  - High power density - the HPE Cray EX 4000 "cabinet" can supply up
  to 400 KW and is equivalent in space to two racks (i.e., 200 KW per
  standard rack). If your data center is space constrained, this
is a crucial
  factor.
  - No mess - DLC with Deionized water (DI water) or with Propylene
  Glycol Water (PGW) systems use dripless connectors.
  - Low MTTR - remove a server and insert another if you have a spare.
   - *Cons*
  - Special racks - HPE cabinets are non-standard and require HPE
  designed servers. This is changing. I saw many examples of ORv3 racks at
  GTC that use the OCP standard with DLC manifolds.
  - Cooling loops - Loops can leak at fittings, be kinked, or crimped
  that restricts flow and cause overheating. Hybrid loops are simpler while
  100% DLC loops are more complex (i.e., expensive). Servers tend
to include
  drip sensors to detect this, but we have found that the DIMMs are better
  drip sensors (i.e., the drips hit them before finding the drip sensor). 
  - Bio-growth
 - DI water includes biocides and you have to manage it. We have
 learned that no system can be bio-growth free (e.g.,
inserting a blade will
 recontaminate the system). That said, Summit has never had any
 biogrowth-induced overheating and Frontier has gone close to
nine months
 without overheating issues due to growth.
 - PGW systems should be immune to any bio-growth but you lose ~30%
 of the heat removal capacity compared to DI water. Depending on your
 environment, you might be able to avoid trim water (i.e.,
mixing in chilled
 water to reduce the temperature).
  - Can be expensive to upgrade the facility (i.e., to install
  evaporative coolers, piping, pumps, etc.).

For ORNL, we are space constrained. For that alone, we prefer DLC over
immersion.
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] [External] anyone have modern interconnect metrics?

2024-01-22 Thread Scott Atchley

On Mon, Jan 22, 2024 at 11:16 AM Prentice Bisbal  wrote:

> 
>
>> > Another interesting topic is that nodes are becoming many-core - any
>> > thoughts?
>>
>> Core counts are getting too high to be of use in HPC. High core-count
>> processors sound great until you realize that all those cores are now
>> competing for same memory bandwidth and network bandwidth, neither of
>> which increase with core-count.
>>
>> Last April we were evaluating test systems from different vendors for a
>> cluster purchase. One of our test users does a lot of CFD simulations
>> that are very sensitive to mem bandwidth. While he was getting a 50%
>> speed up in AMD compared to Intel (which makes sense since AMDs require
>> 12 DIMM slots to be filled instead of Intel's 8), he asked us consider
>> servers with LESS cores. Even with the AMDs, he was saturating the
>> memory bandwidth before scaling to all the cores, causing his
>> performance to plateau. For him, buying cheaper processors with lower
>> core-counts was better for him, since the savings would allow us to by
>> additional nodes, which would be more beneficial to him.
>>
>
> We see this as well in DOE especially when GPUs are doing a significant
> amount of the work.
>
> Yeah, I noticed that Frontier and Aurora will actually be single-socket
> systems w/ "only" 64 cores.
>
 Yes, Frontier is a *single* *CPU* socket and *four GPUs* (actually eight
GPUs from the user's perspective). It works out to eight cores per Graphics
Compute Die (GCD). The FLOPS ratio is roughly 1:100 between the CPU and
GPUs.

Note, Aurora is a dual CPU and six GPU. I am not sure if the user sees six
or more GPUs. The Aurora node is similar to our Summit node but with more
connectivity between the GPUs.
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] [External] anyone have modern interconnect metrics?

2024-01-20 Thread Scott Atchley

On Fri, Jan 19, 2024 at 9:40 PM Prentice Bisbal via Beowulf <
beowulf@beowulf.org> wrote:

> > Yes, someone is sure to say "don't try characterizing all that stuff -
> > it's your application's performance that matters!"  Alas, we're a generic
> > "any kind of research computing" organization, so there are thousands
> > of apps
> > across all possible domains.
>
> 
>
> I agree with you. I've always hated the "it depends on your application"
> stock response in HPC. I think it's BS. Very few of us work in an
> environment where we support only a handful of applications with very
> similar characteristics. I say use standardized benchmarks that test
> specific performance metrics (mem bandwidth or mem latency, etc.),
> first, and then use a few applications to confirm what you're seeing
> with those benchmarks.
>
> 
>

It does depend on the application(s). At OLCF, we have hundreds of
applications. Some pound the network and some do not. Because we are a
Leadership Computing Facility, a user cannot get any time on the machine
unless they can scale to 20% and ideally to 100% of the system. We have
several apps with FFTs which become all-to-alls in MPI. Because of this,
ideally we want a non-blocking fat-tree (i.e., Clos) topology. Every other
topology is a compromise. That said, a full Clos is 2x or more in cost
compared to other common topologies (e.g., dragonfly or a 2:1
oversubscribed, fat-tree). If your workload is small jobs that can fit in a
rack, for example, then by all means save some money and get an
oversubscribed fat-tree, dragonfly, etc. If your jobs need to use the full
machine and they have large message collectives, then you have to bite the
bullet and spend more on network and less on compute and/or storage.

To assess the usage of our parallel file systems, we run with Darshan
installed and it captures data from each MPI job (each job step within a
job). We do not have similar tools to determine how the network is being
used (e.g., how much bandwidth do we need, what communication patterns).
When I was at Myricom and we were releasing Myri-10G, I benchmarked several
ISV codes on 2G versus 10G. If I remember, Fluent did not benefit from the
extra bandwidth, but PowerFlow did a lot.

My point is that "It depends" may not be a satisfying answer, but it is
realistic.

> > Another interesting topic is that nodes are becoming many-core - any
> > thoughts?
>
> Core counts are getting too high to be of use in HPC. High core-count
> processors sound great until you realize that all those cores are now
> competing for same memory bandwidth and network bandwidth, neither of
> which increase with core-count.
>
> Last April we were evaluating test systems from different vendors for a
> cluster purchase. One of our test users does a lot of CFD simulations
> that are very sensitive to mem bandwidth. While he was getting a 50%
> speed up in AMD compared to Intel (which makes sense since AMDs require
> 12 DIMM slots to be filled instead of Intel's 8), he asked us consider
> servers with LESS cores. Even with the AMDs, he was saturating the
> memory bandwidth before scaling to all the cores, causing his
> performance to plateau. For him, buying cheaper processors with lower
> core-counts was better for him, since the savings would allow us to by
> additional nodes, which would be more beneficial to him.
>

We see this as well in DOE especially when GPUs are doing a significant
amount of the work.

Scott

> 
> --
> Prentice
>
>
> On 1/16/24 5:19 PM, Mark Hahn wrote:
> > Hi all,
> > Just wondering if any of you have numbers (or experience) with
> > modern high-speed COTS ethernet.
> >
> > Latency mainly, but perhaps also message rate.  Also ease of use
> > with open-source products like OpenMPI, maybe Lustre?
> > Flexibility in configuring clusters in the >= 1k node range?
> >
> > We have a good idea of what to expect from Infiniband offerings,
> > and are familiar with scalable network topologies.
> > But vendors seem to think that high-end ethernet (100-400Gb) is
> > competitive...
> >
> > For instance, here's an excellent study of Cray/HP Slingshot (non-COTS):
> > https://arxiv.org/pdf/2008.08886.pdf
> > (half rtt around 2 us, but this paper has great stuff about
> > congestion, etc)
> >
> > Yes, someone is sure to say "don't try characterizing all that stuff -
> > it's your application's performance that matters!"  Alas, we're a generic
> > "any kind of research computing" organization, so there are thousands
> > of apps
> > across all possible domains.
> >
> > Another interesting topic is that nodes are becoming many-core - any
> > thoughts?
> >
> > Alternatively, are there other places to ask? Reddit or something less
> > "greybeard"?
> >
> > thanks, mark hahn
> > McMaster U / SharcNET / ComputeOntario / DRI Alliance Canada
> >
> > PS: the snarky name "NVidiband" just occurred to me; too soon?
> > ___
> > Beowulf mailing list, Beowulf@beowulf.org sponsored by

Re: [Beowulf] [EXTERNAL] Re: anyone have modern interconnect metrics?

2024-01-18 Thread Scott Atchley

There is a lot of interest in lower-cost optics, but manufacturing costs
for the alternatives to today's active optical cables have not provided the
promised cost savings. Silicon photonics seems to be just a few years away
just as fusion is just a decade away.

On Wed, Jan 17, 2024 at 6:16 PM Lux, Jim (US 3370) via Beowulf <
beowulf@beowulf.org> wrote:

> To a certain extent, faster Ethernet is more likely to be a commodity –
> and at rates above 1 Gbps, there’s substantial “art” in making a PHY that
> works reliably.   At the 10G speed, there’s things like RapidIO and SRIO,
> but they
>
>1. Only work for short distances (<<1 meter)
>2. Are **very** board layout and other implementation sensitive.  Fine
>for getting in and out of a package, but not great for running any 
> distance.
>
> Then there’s XAUI (pronounced Zowie!) which is a multiwire wire interface
> between logic and 10G (or whatever) PHY.   But it’s got the same problems
> as SRIO/RapidIO (or for that matter, the venerable (now) TLK2711 SERDES).
>
>
>
> 10G and 40G Ethernet do actually work over distances of meters, and over
> some moderate range of temperatures, and are likely to meet EMI/EMC
> requirements.
>
>
>
> It is interesting that there doesn’t seem to be the same commercial
> pressure for optical versions. They all exist, but typically as modules
> you’d slide into your switch, not components you’d solder to a board.  And
> there are plenty of XAUI->optical kinds of interfaces.  And optical cables
> are cheap and relatively rugged.
>
>
>
>
>
> *From:* Beowulf  *On Behalf Of *Scott Atchley
> *Sent:* Wednesday, January 17, 2024 7:18 AM
> *To:* Larry Stewart 
> *Cc:* Mark Hahn ; beowulf@beowulf.org
> *Subject:* [EXTERNAL] Re: [Beowulf] anyone have modern interconnect
> metrics?
>
>
>
> While I was at Myricom, the founder, Chuck Seitz, used to say that there
> was Ethernet and Ethernot. He tied Myricom's fate to Ethernet's 10G PHYs.
>
>
>
> On Wed, Jan 17, 2024 at 9:08 AM Larry Stewart  wrote:
>
> I don't know what the networking technology of the future will be like,
> but it will be called Ethernet.
> - unknown (to me)
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] anyone have modern interconnect metrics?

2024-01-17 Thread Scott Atchley

While I was at Myricom, the founder, Chuck Seitz, used to say that there
was Ethernet and Ethernot. He tied Myricom's fate to Ethernet's 10G PHYs.

On Wed, Jan 17, 2024 at 9:08 AM Larry Stewart  wrote:

> I don't know what the networking technology of the future will be like,
> but it will be called Ethernet.
> - unknown (to me)
>
>
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] anyone have modern interconnect metrics?

2024-01-17 Thread Scott Atchley

I don't think that UE networks are available yet.

On Wed, Jan 17, 2024 at 3:13 AM Jan Wender via Beowulf 
wrote:

> Hi Mark, hi all,
>
> The limitations of Ethernet seem to be recognised by many participants in
> the network area. That is the reason for the founding of the Ultra-Ethernet
> alliance: https://ultraethernet.org/ Maybe on that site you could find
> more information?
> (Disclosure: I work for HPE, one of the founding members, but am not
> involved directly with this).
>
> Cheerio, Jan
> —
> Jan Wender - j.wen...@web.de - Signal/Mobile: +4915780949428 - Threema
> EPD4T5B4
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Checkpointing MPI applications

2023-03-27 Thread Scott Atchley

On Thu, Mar 23, 2023 at 3:46 PM Christopher Samuel 
wrote:

> On 2/19/23 10:26 am, Scott Atchley wrote:
>
> > We are looking at SCR for Frontier with the idea that users can store
> > checkpoints on the node-local drives with replication to a buddy node.
> > SCR will manage migrating non-defensive checkpoints to Lustre.
>
> Interesting, does it really need local storage or can it be used with
> diskless systems via tricks with loopback filesystems, etc?


Yes, it only needs a mount path. It can be ramfs/tmpfs, xfs (or other local
file system), etc.

Scott
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Checkpointing MPI applications

2023-02-19 Thread Scott Atchley

Hi Chris,

It looks like it tries to checkpoint application state without
checkpointing the application or its libraries (including MPI). I am
curious if the checkpoint sizes are similar or significantly larger to the
application's typical outputs/checkpoints. If they are much larger, the
time to write will be higher and they will stress capacity more.

We are looking at SCR for Frontier with the idea that users can store
checkpoints on the node-local drives with replication to a buddy node. SCR
will manage migrating non-defensive checkpoints to Lustre.

Scott

On Sat, Feb 18, 2023 at 3:43 PM Christopher Samuel 
wrote:

> Hi all,
>
> The list has been very quiet recently, so as I just posted something to
> the Slurm list in reply to the topic of checkpointing MPI applications I
> thought it might interest a few of you here (apologies if you've already
> seen it there).
>
> If you're looking to try checkpointing MPI applications you may want to
> experiment with the MANA ("MPI-Agnostic, Network-Agnostic MPI") plugin
> for the DMTCP C/R effort here:
>
> https://github.com/mpickpt/mana
>
> We (NERSC) are collaborating with the developers and it is installed on
> Cori (our older Cray system) for people to experiment with. The
> documentation for it may be useful to others who'd like to try it out -
> it's got a nice description of how it works too which even I, as a
> non-programmer, can understand.
>
> https://docs.nersc.gov/development/checkpoint-restart/mana/
>
> Pay special attention to the caveats in our docs though!
>
> I've not used it myself, though I'm peripherally involved to give advice
> on system related issues.
>
> I'm curious if there are other methods that people are using out there
> for transparent checkpointing of MPI applications?
>
> All the best,
> Chris
> --
> Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Top 5 reasons why mailing lists are better than Twitter

2022-11-21 Thread Scott Atchley

We have OpenMPI running on Frontier with libfabric. We are using HPE's CXI
(Cray eXascale Interface) provider instead of RoCE though.

On Sat, Nov 19, 2022 at 2:57 AM Matthew Wallis via Beowulf <
beowulf@beowulf.org> wrote:

>
>
> ;-)
>
> 1. Less spam.
> 2. Private DMs, just email the person.
> 3. Long form.
> 4. No social media plagues of likes, dislikes, ratios.
> 5. Identity verification via gpg
>
> Not necessarily a comprehensive list, and there’s definitely something to
> be said for twitter’s bitesized content making it easy to consume a stream
> of consciousness from around the world, both positive, and negative.
>
> On topic, if anyone has experience using libfabric with OpenMPI I’d be
> interested to hear. I need to do a little more poking, but trying to get
> RoCE on anything other than Mellanox still seems to be painful.
>
> Matt.
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] likwid vs stream (after HPCG discussion)

2022-03-20 Thread Scott Atchley

On Sat, Mar 19, 2022 at 6:29 AM Mikhail Kuzminsky  wrote:

> If so, it turns out that for the HPC user, stream gives a more
> important estimate - the application is translated by the compiler
> (they do not write in assembler - except for modules from mathematical
> libraries), and stream will give a real estimate of what will be
> received in the application.
>

When vendors advertise STREAM results, they compile the application with
non-temporal loads and stores. This means that all memory accesses bypass
the processor's caches. If your application of interest does a random walk
through memory and there is neither temporal or spatial locality, then
using non-temporal loads and stores makes sense and STREAM irrelevant.

If you want to know what memory bandwidth that your application may
achieve, you can use STREAM without the compiler flags to enable
non-temporal loads and stores.
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Data Destruction

2021-09-29 Thread Scott Atchley

For our users that have sensitive data, we keep it encrypted at rest and in
movement.

For HDD-based systems, you can perform a secure erase per NIST standards.
For SSD-based systems, the extra writes from the secure erase will
contribute to the wear on the drives and possibly their eventually wearing
out. Most SSDs provide an option to mark blocks as zero without having to
write the zeroes. I do not think that it is exposed up to the PFS layer
(Lustre, GPFS, Ceph, NFS) and is only available at the ext4 or XFS layer.

On Wed, Sep 29, 2021 at 10:15 AM Paul Edmon  wrote:

> The former.  We are curious how to selectively delete data from a parallel
> filesystem.  For example we commonly use Lustre, ceph, and Isilon in our
> environment.  That said if other types allow for easier destruction of
> selective data we would be interested in hearing about it.
>
> -Paul Edmon-
> On 9/29/2021 10:06 AM, Scott Atchley wrote:
>
> Are you asking about selectively deleting data from a parallel file system
> (PFS) or destroying drives after removal from the system either due to
> failure or system decommissioning?
>
> For the latter, DOE does not allow us to send any non-volatile media
> offsite once it has had user data on it. When we are done with drives, we
> have a very big shredder.
>
> On Wed, Sep 29, 2021 at 9:59 AM Paul Edmon via Beowulf <
> beowulf@beowulf.org> wrote:
>
>> Occassionally we get DUA (Data Use Agreement) requests for sensitive
>> data that require data destruction (e.g. NIST 800-88). We've been
>> struggling with how to handle this in an era of distributed filesystems
>> and disks.  We were curious how other people handle requests like this?
>> What types of filesystems to people generally use for this and how do
>> people ensure destruction?  Do these types of DUA's preclude certain
>> storage technologies from consideration or are there creative ways to
>> comply using more common scalable filesystems?
>>
>> Thanks in advance for the info.
>>
>> -Paul Edmon-
>>
>> ___
>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Data Destruction

2021-09-29 Thread Scott Atchley

Are you asking about selectively deleting data from a parallel file system
(PFS) or destroying drives after removal from the system either due to
failure or system decommissioning?

For the latter, DOE does not allow us to send any non-volatile media
offsite once it has had user data on it. When we are done with drives, we
have a very big shredder.

On Wed, Sep 29, 2021 at 9:59 AM Paul Edmon via Beowulf 
wrote:

> Occassionally we get DUA (Data Use Agreement) requests for sensitive
> data that require data destruction (e.g. NIST 800-88). We've been
> struggling with how to handle this in an era of distributed filesystems
> and disks.  We were curious how other people handle requests like this?
> What types of filesystems to people generally use for this and how do
> people ensure destruction?  Do these types of DUA's preclude certain
> storage technologies from consideration or are there creative ways to
> comply using more common scalable filesystems?
>
> Thanks in advance for the info.
>
> -Paul Edmon-
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] AMD and AVX512

2021-06-16 Thread Scott Atchley

On Wed, Jun 16, 2021 at 1:15 PM Prentice Bisbal via Beowulf <
beowulf@beowulf.org> wrote:

> Did anyone else attend this webinar panel discussion with AMD hosted by
> HPCWire yesterday? It was titled "AMD HPC Solutions: Enabling Your
> Success in HPC"
>
> https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/
>
> I attended it, and noticed there was no mention of AMD supporting
> AVX512, so during the question and answer portion of the program, I
> asked when AMD processors will support AVX512. The answer given, and I'm
> not making this up, is that AMD listens to their users and gives the
> users what they want, and right now they're not hearing any demand for
> AVX512.
>
> Personally, I call BS on that one. I can't imagine anyone in the HPC
> community saying "we'd like processors that offer only 1/2 the floating
> point performance of Intel processors". Sure, AMD can offer more cores,
> but with only AVX2, you'd need twice as many cores as Intel processors,
> all other things being equal.
>
> Last fall I evaluated potential new cluster nodes for a large cluster
> purchase using the HPL benchmark. I compared a server with dual AMD EPYC
> 7H12 processors (128) cores to a server with quad Intel Xeon 8268
> processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and
> only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only
> 64% of the Xeon 8268 system, despite having 33% more cores.
>
>  From what I've heard, the AMD processors run much hotter than the Intel
> processors, too, so I imagine a FLOPS/Watt comparison would be even less
> favorable to AMD.
>
> An argument can be made that for calculations that lend themselves to
> vectorization should be done on GPUs, instead of the main processors but
> the last time I checked, GPU jobs are still memory is limited, and
> moving data in and out of GPU memory can still take time, so I can see
> situations where for large amounts of data using CPUs would be preferred
> over GPUs.
>
> Your thoughts?
>
> --
> Prentice
>

AMD has studied this quite a bit in DOE's FastForward-2 and PathForward. I
think Carlos' comment is on track. Having a unit that cannot be fed data
quick enough is pointless. It is application dependent. If your working set
fits in cache, then the vector units work well. If not, you have to move
data which stalls compute pipelines. NERSC saw only a 10% increase in
performance when moving from low core count Xeon CPUs with AVX2 to Knights
Landing with many cores and AVX-512 when it should have seen an order of
magnitude increase. Although Knights Landing had MCDRAM (Micron's not-quite
HBM), other constraints limited performance (e.g., lack of enough memory
references in flight, coherence traffic).

Fujitsu's ARM64 chip with 512b SVE in Fugaku does much better than Xeon
with AVX-512 (or Knights Landing) because of the High Bandwidth Memory
(HBM) attached and I assume a larger number of memory references in flight.
The downside is the lack of memory capacity (only 32 GB per node). This
shows that it is possible to get more performance with a CPU with a 512b
vector engine. That said, it is not clear that even this CPU design can
extract the most from the memory bandwidth. If you look at the increase in
memory bandwidth from Summit to Fugaku, one would expect performance on
real apps to increase by that amount as well. From the presentations that I
have seen, that is not always the case. For some apps, the GPU
architecture, with its coherence on demand rather than with every
operation, can extract more performance.

AMD will add 512b vectors if/when it makes sense on real apps.
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Project Heron at the Sanger Institute [EXT]

2021-02-04 Thread Scott Atchley

On Thu, Feb 4, 2021 at 9:23 AM Jörg Saßmannshausen <
sassy-w...@sassy.formativ.net> wrote:

> One of the things I heard a few times is the use of GPUs for the analysis.
> Is
> that something you are doing as well?


ORNL definitely is. We were the first to contribute cycles to the COVID-19
HPC Consortium:

https://www.olcf.ornl.gov/2020/03/30/summit-joins-the-covid-19-high-performance-computing-consortium/

Two finalists for the SC19 Gordon Bell special prize on COVID research used
Summit:

https://www.olcf.ornl.gov/2020/11/18/two-finalists-nominated-for-gordon-bell-special-prize-for-covid-19-work-on-summit/

One of which won the prize:

https://www.olcf.ornl.gov/2020/11/18/multi-institutional-team-earns-gordon-bell-special-prize-finalist-nomination-for-rapid-covid-19-molecular-docking-simulations/


> Also, on the topic of GPUs (and being a
> bit controversial): are there actually any programs out there which are
> not
> using nVidia GPUs and use the AMD ones for example?


LLNL is with their Corona system:

https://www.llnl.gov/news/upgrades-llnl-supercomputer-amd-penguin-computing-aid-covid-19-research
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Julia on POWER9?

2020-10-16 Thread Scott Atchley

% hostname -f

login1.summit.olcf.ornl.gov


% module avail |& grep julia

   forge/19.0.4 ibm-wml-ce/1.6.1-1   *julia*/1.4.2
  (E)ppt/2.4.0-beta2   (D)vampir/9.5.0   (D)

[*atchley*@*login1*]*~ *% module avail julia



/sw/summit/modulefiles/core


   julia/1.4.2 (*E*)


  Where:

   *E*:  Experimental



On Thu, Oct 15, 2020 at 5:02 PM Prentice Bisbal via Beowulf <
beowulf@beowulf.org> wrote:

> So while you've all been discussing Julia, etc., I've been trying to
> build and get it running on POWER9 for a cluster of AC922 nodes (same as
> Summit, but with 4 GPUs per node). After doing a combination of Google
> searching and soul-searching, I was able to get a functional version of
> Julia to build for POWER9. However, I'm not 100% sure my build is fully
> functional, as when I did 'make testall' some of the tests failed.
>
> Is there anyone on this list using or supporting the latest version of
> Julia, 1.5.2, on POWER9? If so, I'd like to compare notes. I imagine
> someone from OLCF is on this list.
>
> Based on my Internet searching, as of August 2019 Julia was being used
> on Summit on thousands of cores, but I've also seen posts from the Julia
> devs saying they can't support the POWER architecture anymore because
> they no longer have access to POWER hardware. Most of this information
> comes from the Julia GitHub or Julia Discourse conversations.
>
> --
> Prentice
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Best case performance of HPL on EPYC 7742 processor ...

2020-08-17 Thread Scott Atchley

I do not have any specific HPL hints.

I would suggest setting the BIOS to NUMAs-Per-Socket to 4 (NSP-4). I would
try running 16 processes, one per CCX - two per CCD, with an OpenMP depth
of 4.

Dell's HPC blog has a few articles on tuning Rome:

https://www.dell.com/support/article/en-us/sln319015/amd-rome-is-it-for-real-architecture-and-initial-hpc-performance

Scott


On Fri, Aug 14, 2020 at 5:30 PM Richard Walsh  wrote:

>
> All,
>
> What have people achieved on this SKU on a single-node using the stock
> HPL 2.3 source... ??
>
> I have seen a variety of performance claims even as high as 90% of its
> nominal
> per node peak of 4.608 TFLOPs.  I can now get above 80% of peak, but not
> higher.
> I have heard that to get higher values special BIOS settings are required,
> including
> the turning off SMT which allows the chip to turbo higher.  Remember this
> is not the
> 7542 processor with 32 cores per chip and the same bandwidth per socket as
> the
> 7742 which can turbo to over 100% of nominal peak for HPL.
>
> If people have gotten higher single node numbers ... what is your recipe
> ... ??
>
> I am particularly interested in BIOS settings, and maybe surprise settings
> in the HPL.dat file.  Do higher performing runs require using close to the
> maximum memory on the node ... ??  As this is single-node, I would not
> expect choice of MPI to make a difference
>
> To get to 80% with SMT on in the BIOS, I am building with an older Intel
> compiler and MKL that still recognizes the MKL_DEBUG_CPU_TYPE=5.
> Running so that the number of MPI ranks run on the node matches the
> number of CCXs seems ot give the best numbers.
>
> Following the tuning instructions from AMD for using BLIS and GCC for
> the build does not get me there.
>
> Thanks,
>
> Richard Walsh
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Power per area

2020-03-11 Thread Scott Atchley

Hi Stu,

The rolling weight is only an issue when moving equipment during
installation/removal. It causes point loads and we typically lay steel
plate down to spread the load over multiple tiles.

What is your performance density (FLOPS/ft^2) in Houston if you do not mind
me asking?

Scott

On Tue, Mar 10, 2020 at 8:30 PM Stu Midgley  wrote:

> Immersion cooling makes a lot of sense :)
>
> We run it on the 21st floor of a building in Kuala Lumpur, on the 1st
> floor in Perth and on a slab-on-ground in Houston.
>
> The tanks+fluid are light.  When full of equipment, about 1.2 tonnes
> spread over about 2m2.
>
> In Houston, we have the lowest rated raised floor - since the tanks spread
> the load across multiple tiles/floor stands, and there is no rolling weight
> (its spread evenly over 2m2)
>
> So ~ 2200lbs/10sqft ie. about 220lbs/sqft .  We run a power density of
> 8.5kW/sqm (~800W/sqft) across our whole DC (which includes all the internal
> white space/CRAC space etc).
>
> We cool the whole facility with evaporation (compressor cooling is only
> for comfort cooling).
>
> We have hit a PUE of 1.045 in Houston...   and 1.035 in Perth :)
>
> Come and have a look at our Houston DC :)
>
>
>
> On Wed, Mar 11, 2020 at 3:37 AM Scott Atchley 
> wrote:
>
>> Hi everyone,
>>
>> I am wondering whether immersion cooling makes sense. We are most limited
>> by datacenter floor space. We can manage to bring in more power (up to 40
>> MW for Frontier) and install more cooling towers (ditto), but we cannot
>> simply add datacenter space. We have asked to build new building and the
>> answer has been consistently "No."
>>
>> Summit is mostly water cooled. Each node has cold plates on the CPUs and
>> GPUs. Fans are needed to cool the memory and power supplies and is captured
>> by rear-door heart exchangers. It occupies roughly 5,600 ft^2. With 200 PF
>> of performance and 14 MW of power, that is 36 TF/ft^2 and 2.5 kW/ft^2.
>>
>> I am wondering what the comparable performance and power is per square
>> foot for the densest, deployed (not theoretical) immersion cooled systems.
>> Any ideas?
>>
>> To make the exercise even more fun, what is the weight per square foot
>> for immersion systems? Our data centers have a limit of 250 or 500
>> pounds per square foot. I expect immersion systems to need higher loadings
>> than that.
>>
>> Thanks,
>>
>> Scott
>> ___
>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>
>
>
> --
> Dr Stuart Midgley
> sdm...@gmail.com
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Power per area

2020-03-10 Thread Scott Atchley

Thanks for the offer.

This is an academic exercise for now. Our budgets are committed to through
2026 for Frontier. 

On Tue, Mar 10, 2020 at 4:11 PM Jeff Johnson 
wrote:

> Scott,
>
> They are about to release a 85kW version of the rack, same dimensions. Let
> me know if you want me to connect you with their founder/inventor.
>
> --Jeff
>
> On Tue, Mar 10, 2020 at 1:08 PM Scott Atchley 
> wrote:
>
>> Hi Jeff,
>>
>> Interesting, I have not seen this yet.
>>
>> Looking at their 52 kW rack's dimensions, it works out to 3.7 kW/ft^2 for
>> the enclosure if we do not count the row pitch. If we add 4-5 feet for row
>> pitch, then it drops to 2.2-2.4 kW/ft^2. Assuming Summit's IBM AC922 nodes
>> fit and again a row pitch of 4-5 feet, the performance per area would be
>> 31-34 TF/ft^2. Both the performance per area and the power per are are
>> close to Summit. Their PUE (1.15-1.2) is higher than we get on Summit (1.05
>> for 9 months and 1.1-1.2 for 3 months). It is very interesting for data
>> centers that have widely varying loads for adjacent cabinets.
>>
>> Scott
>>
>> On Tue, Mar 10, 2020 at 3:47 PM Jeff Johnson <
>> jeff.john...@aeoncomputing.com> wrote:
>>
>>> Scott,
>>>
>>> It's not immersion but it's a different approach to the conventional
>>> rack cooling approach. It's really cool (literally and figuratively).
>>> They're based here in San Diego.
>>>
>>> https://ddcontrol.com/
>>>
>>> --Jeff
>>>
>>> On Tue, Mar 10, 2020 at 12:37 PM Scott Atchley <
>>> e.scott.atch...@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I am wondering whether immersion cooling makes sense. We are most
>>>> limited by datacenter floor space. We can manage to bring in more power (up
>>>> to 40 MW for Frontier) and install more cooling towers (ditto), but we
>>>> cannot simply add datacenter space. We have asked to build new building and
>>>> the answer has been consistently "No."
>>>>
>>>> Summit is mostly water cooled. Each node has cold plates on the CPUs
>>>> and GPUs. Fans are needed to cool the memory and power supplies and is
>>>> captured by rear-door heart exchangers. It occupies roughly 5,600 ft^2.
>>>> With 200 PF of performance and 14 MW of power, that is 36 TF/ft^2 and 2.5
>>>> kW/ft^2.
>>>>
>>>> I am wondering what the comparable performance and power is per square
>>>> foot for the densest, deployed (not theoretical) immersion cooled systems.
>>>> Any ideas?
>>>>
>>>> To make the exercise even more fun, what is the weight per square foot
>>>> for immersion systems? Our data centers have a limit of 250 or 500
>>>> pounds per square foot. I expect immersion systems to need higher loadings
>>>> than that.
>>>>
>>>> Thanks,
>>>>
>>>> Scott
>>>> ___
>>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin
>>>> Computing
>>>> To change your subscription (digest mode or unsubscribe) visit
>>>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>>>
>>>
>>>
>>> --
>>> --
>>> Jeff Johnson
>>> Co-Founder
>>> Aeon Computing
>>>
>>> jeff.john...@aeoncomputing.com
>>> www.aeoncomputing.com
>>> t: 858-412-3810 x1001   f: 858-412-3845
>>> m: 619-204-9061
>>>
>>> 4170 Morena Boulevard, Suite C - San Diego, CA 92117
>>>
>>> High-Performance Computing / Lustre Filesystems / Scale-out Storage
>>>
>>
>
> --
> --
> Jeff Johnson
> Co-Founder
> Aeon Computing
>
> jeff.john...@aeoncomputing.com
> www.aeoncomputing.com
> t: 858-412-3810 x1001   f: 858-412-3845
> m: 619-204-9061
>
> 4170 Morena Boulevard, Suite C - San Diego, CA 92117
>
> High-Performance Computing / Lustre Filesystems / Scale-out Storage
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Power per area

2020-03-10 Thread Scott Atchley

Summit uses both cold plates and rear-door heat exchangers. Frontier will
use Cray's Shasta cabinet which is all water cooled (no fans at all) and is
very high density. I am thinking of the systems after Frontier. Can we
continue to use cold plates or at some point does immersion become the only
alternative? I am not familiar enough with today's immersion, but I am not
sure that it is any more dense or efficient than Summit is. I look at
pictures of cyber-currency facilities with immersion tanks that are 3-4
feet high and I see a lot of empty space above them which could be holding
more compute.

On Tue, Mar 10, 2020 at 7:01 PM Jörg Saßmannshausen <
sassy-w...@sassy.formativ.net> wrote:

> Dear all,
>
> yes, there is a max load of the floor and you really should stick to that,
> even though that might open a hole, pardon, a door for a new data centre.
> :-)
>
> There are various way of getting more cooling done. You can use doors
> which
> are cooled, the already described plates on the CPU, you can basically use
> a
> large trough and put your nodes in there (the trough is filled with oil,
> heat
> it up enough and you can fry your chips (not)), you can do that smaller as
> Iceotop demonstrated:
> https://www.iceotope.com/
>
> I guess there are a number of ways you can address the problem. Multi-core
> CPUs, like the new AMD ones, might also be a solution as you can get more
> cores per area.
>
> I hope that helps a bit.
>
> Jörg
>
> Am Dienstag, 10. März 2020, 20:26:18 GMT schrieb David Mathog:
> > On Tue, 10 Mar 2020 15:36:42 -0400 Scott Atchley wrote:
> > > To make the exercise even more fun, what is the weight per square foot
> > > for
> > > immersion systems? Our data centers have a limit of 250 or 500 pounds
> > > per
> > > square foot.
> >
> > I am not an architect but...
> >
> > Aren't there two load values for a floor?  The one I think you are
> > citing is the amount of weight which can safely be placed in a "small"
> > floor area without punching through or causing other localized damage,
> > the other is the total weight that can be placed on that floor without
> > the building collapsing.  If the whole data center is on the ground
> > floor sitting right on a concrete slab with no voids beneath it I would
> > expect the latter value to be huge and not a real concern, but it might
> > be less than (500 pounds per square foot) X (total area) on the 2nd or
> > higher floors.
> >
> > Regards,
> >
> > David Mathog
> > mat...@caltech.edu
> > Manager, Sequence Analysis Facility, Biology Division, Caltech
> > ___
> > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
>
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Power per area

2020-03-10 Thread Scott Atchley

On Tue, Mar 10, 2020 at 4:26 PM David Mathog  wrote:

> On Tue, 10 Mar 2020 15:36:42 -0400 Scott Atchley wrote:
>
> > To make the exercise even more fun, what is the weight per square foot
> > for
> > immersion systems? Our data centers have a limit of 250 or 500 pounds
> > per
> > square foot.
>
> I am not an architect but...
>
> Aren't there two load values for a floor?  The one I think you are
> citing is the amount of weight which can safely be placed in a "small"
> floor area without punching through or causing other localized damage,
> the other is the total weight that can be placed on that floor without
> the building collapsing.  If the whole data center is on the ground
> floor sitting right on a concrete slab with no voids beneath it I would
> expect the latter value to be huge and not a real concern, but it might
> be less than (500 pounds per square foot) X (total area) on the 2nd or
> higher floors.
>

There is the static load and rolling load. I was quoting the static load
for our current raised floors.

I agree that ground floor slab should be effectively unlimited and it is
unclear what our second floor concrete-over-steel floor can hold.
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Power per area

2020-03-10 Thread Scott Atchley

Hi Jeff,

Interesting, I have not seen this yet.

Looking at their 52 kW rack's dimensions, it works out to 3.7 kW/ft^2 for
the enclosure if we do not count the row pitch. If we add 4-5 feet for row
pitch, then it drops to 2.2-2.4 kW/ft^2. Assuming Summit's IBM AC922 nodes
fit and again a row pitch of 4-5 feet, the performance per area would be
31-34 TF/ft^2. Both the performance per area and the power per are are
close to Summit. Their PUE (1.15-1.2) is higher than we get on Summit (1.05
for 9 months and 1.1-1.2 for 3 months). It is very interesting for data
centers that have widely varying loads for adjacent cabinets.

Scott

On Tue, Mar 10, 2020 at 3:47 PM Jeff Johnson 
wrote:

> Scott,
>
> It's not immersion but it's a different approach to the conventional rack
> cooling approach. It's really cool (literally and figuratively). They're
> based here in San Diego.
>
> https://ddcontrol.com/
>
> --Jeff
>
> On Tue, Mar 10, 2020 at 12:37 PM Scott Atchley 
> wrote:
>
>> Hi everyone,
>>
>> I am wondering whether immersion cooling makes sense. We are most limited
>> by datacenter floor space. We can manage to bring in more power (up to 40
>> MW for Frontier) and install more cooling towers (ditto), but we cannot
>> simply add datacenter space. We have asked to build new building and the
>> answer has been consistently "No."
>>
>> Summit is mostly water cooled. Each node has cold plates on the CPUs and
>> GPUs. Fans are needed to cool the memory and power supplies and is captured
>> by rear-door heart exchangers. It occupies roughly 5,600 ft^2. With 200 PF
>> of performance and 14 MW of power, that is 36 TF/ft^2 and 2.5 kW/ft^2.
>>
>> I am wondering what the comparable performance and power is per square
>> foot for the densest, deployed (not theoretical) immersion cooled systems.
>> Any ideas?
>>
>> To make the exercise even more fun, what is the weight per square foot
>> for immersion systems? Our data centers have a limit of 250 or 500
>> pounds per square foot. I expect immersion systems to need higher loadings
>> than that.
>>
>> Thanks,
>>
>> Scott
>> ___
>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>
>
>
> --
> --
> Jeff Johnson
> Co-Founder
> Aeon Computing
>
> jeff.john...@aeoncomputing.com
> www.aeoncomputing.com
> t: 858-412-3810 x1001   f: 858-412-3845
> m: 619-204-9061
>
> 4170 Morena Boulevard, Suite C - San Diego, CA 92117
>
> High-Performance Computing / Lustre Filesystems / Scale-out Storage
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

[Beowulf] Power per area

2020-03-10 Thread Scott Atchley

Hi everyone,

I am wondering whether immersion cooling makes sense. We are most limited
by datacenter floor space. We can manage to bring in more power (up to 40
MW for Frontier) and install more cooling towers (ditto), but we cannot
simply add datacenter space. We have asked to build new building and the
answer has been consistently "No."

Summit is mostly water cooled. Each node has cold plates on the CPUs and
GPUs. Fans are needed to cool the memory and power supplies and is captured
by rear-door heart exchangers. It occupies roughly 5,600 ft^2. With 200 PF
of performance and 14 MW of power, that is 36 TF/ft^2 and 2.5 kW/ft^2.

I am wondering what the comparable performance and power is per square foot
for the densest, deployed (not theoretical) immersion cooled systems. Any
ideas?

To make the exercise even more fun, what is the weight per square foot for
immersion systems? Our data centers have a limit of 250 or 500 pounds per
square foot. I expect immersion systems to need higher loadings than that.

Thanks,

Scott
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Interactive vs batch, and schedulers

2020-01-17 Thread Scott Atchley

Hi Jim,

While we allow both batch and interactive, the scheduler handles them the
same. The scheduler uses queue time, node count, requested wall time,
project id, and others to determine when items run. We have backfill turned
on so that when the scheduler allocates a large job and the time to drain
those nodes, it schedules smaller jobs in its footprint as long as their
requested wall time would end before the last node becomes available. We
also have a queue that is preemptable that can run in the backfill window.

While not addressing your concern directly, we found that scheduling large
and small jobs slightly differently makes a difference. The scheduler
typically has a list that enumerates the nodes. We changed the scheduler to
use the list as usual for large jobs but changed it to use the list in
reverse so that small jobs are placed at the "end" of the list. The paper
is A multi-faceted approach to job placement for improved performance on
extreme-scale systems .
When we started seeing GPU failures and we replaced half the GPUs, we
modified the scheduler's list to schedule large, GPU jobs on the new GPUs
and small jobs and CPU-only jobs on the nodes with old GPUs. That paper is GPU
age-aware scheduling to improve the reliability of leadership jobs on Titan.
 You might be able to
modify these techniques to help your situation.

Scott

On Thu, Jan 16, 2020 at 6:25 PM Lux, Jim (US 337K) via Beowulf <
beowulf@beowulf.org> wrote:

> Are there any references out there that discuss the tradeoffs between
> interactive and batch scheduling (perhaps some from the 60s and 70s?) –
>
> Most big HPC systems have a mix of giant jobs and smaller ones managed by
> some process like PBS or SLURM, with queues of various sized jobs.
>
>
>
> What I’m interested in is the idea of jobs that, if spread across many
> nodes (dozens) can complete in seconds (<1 minute) providing essentially
> “interactive” access, in the context of large jobs taking days to
> complete.   It’s not clear to me that the current schedulers can actually
> do this – rather, they allocate M of N nodes to a particular job pulled out
> of a series of queues, and that job “owns” the nodes until it completes.
> Smaller jobs get run on (M-1) of the N nodes, and presumably complete
> faster, so it works down through the queue quicker, but ultimately, if you
> have a job that would take, say, 10 seconds on 1000 nodes, it’s going to
> take 20 minutes on 10 nodes.
>
>
>
> Jim
>
>
>
>
>
> --
>
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] HPC demo

2020-01-14 Thread Scott Atchley

Yes, we have built a few of them. We have one here, one at AMSE, and one
that travels to schools in one of our traveling science trailers.

On Tue, Jan 14, 2020 at 10:29 AM John McCulloch 
wrote:

> Hey Scott, I think I saw an exhibit like what you’re describing at the
> AMSE when I was on a project in Oak Ridge. Was that it?
>
>
>
> John McCulloch | PCPC Direct, Ltd. | desk 713-344-0923
>
>
>
> *From:* Scott Atchley 
> *Sent:* Tuesday, January 14, 2020 7:19 AM
> *To:* John McCulloch 
> *Cc:* beowulf@beowulf.org
> *Subject:* Re: [Beowulf] HPC demo
>
>
>
> We still have Tiny Titan <https://tinytitan.github.io> even though Titan
> is gone. It allows users to toggle processors on and off and the display
> has a mode where the "water" is colored coded by the processor, which has a
> corresponding light. You can see the frame rate go up as you add processors
> and the motion becomes much more fluid.
>
>
>
> On Mon, Jan 13, 2020 at 7:35 PM John McCulloch 
> wrote:
>
> I recently inherited management of a cluster and my knowledge is limited
> to a bit of Red Hat. I need to figure out a demo for upper management
> graphically demonstrating the speed up of running a parallel app on one x86
> node versus multiple nodes up to 36. They have dual Gold 6132 procs and
> Mellanox EDR interconnect. Any suggestions would be appreciated.
>
>
>
> Respectfully,
>
> John McCulloch | PCPC Direct, Ltd.
>
>
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] HPC demo

2020-01-14 Thread Scott Atchley

We still have Tiny Titan  even though Titan is
gone. It allows users to toggle processors on and off and the display has a
mode where the "water" is colored coded by the processor, which has a
corresponding light. You can see the frame rate go up as you add processors
and the motion becomes much more fluid.

On Mon, Jan 13, 2020 at 7:35 PM John McCulloch  wrote:

> I recently inherited management of a cluster and my knowledge is limited
> to a bit of Red Hat. I need to figure out a demo for upper management
> graphically demonstrating the speed up of running a parallel app on one x86
> node versus multiple nodes up to 36. They have dual Gold 6132 procs and
> Mellanox EDR interconnect. Any suggestions would be appreciated.
>
>
>
> Respectfully,
>
> John McCulloch | PCPC Direct, Ltd.
>
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] traverse @ princeton

2019-10-11 Thread Scott Atchley

Okay, that is the same slot Summit/Sierra use for the EDR HCA. You may want
to check out our paper at SC19 where we look at several new features in EDR
as well as how to best stripe data over the four virtual ports.

On Thu, Oct 10, 2019 at 1:49 PM Bill Wichser  wrote:

> Actually 12 per rack.  The reasoning was that there were 2 connections
> per host to top of rack switch leaving 12 uplinkks to two tier0 switches
> at 6 each.
>
> For the IB cards they are some special flavored Mellanox which attach to
> the PCIv4 sockets, 8 lanes each.  And since 8 lanes of v4 == 16 lanes of
> v3, we get full EDR to both CPU sockets.
>
> Bill
>
> On 10/10/19 12:57 PM, Scott Atchley wrote:
> > That is better than 80% peak, nice.
> >
> > Is it three racks of 15 nodes? Or two racks of 18 and 9 in the third
> rack?
> >
> > You went with a single-port HCA per socket and not the shared, dual-port
> > HCA in the shared PCIe slot?
> >
> > On Thu, Oct 10, 2019 at 8:48 AM Bill Wichser  > <mailto:b...@princeton.edu>> wrote:
> >
> > Thanks for the kind words.  Yes, we installed more like a mini-Sierra
> > machine which is air cooled.  There are 46 nodes of the IBM AC922,
> two
> > socket, 4 V100 where each socket uses the SMT threading x4.  So two
> 16
> > core chips, 32/node, 128 threads per node.  The GPUs all use NVLink.
> >
> > There are two EDR connections per host, each tied to a CPU, 1:1 per
> > rack
> > of 12 and 2:1 between racks.  We have a 2P scratch filesystem running
> > GPFS.  Each node also has a 3T NVMe card as well for local scratch.
> >
> > And we're running Slurm as our scheduler.
> >
> > We'll see if it makes the top500 in November.  It fits there today
> but
> > who knows what else got on there since June.  With the help of
> > nVidia we
> > managed to get 1.09PF across 45 nodes.
> >
> > Bill
> >
> > On 10/10/19 7:45 AM, Michael Di Domenico wrote:
> >  > for those that may not have seen
> >  >
> >  >
> >
> https://insidehpc.com/2019/10/traverse-supercomputer-to-accelerate-fusion-research-at-princeton/
> >  >
> >  > Bill Wischer and Prentice Bisbal are frequent contributors to the
> >  > list, Congrats on the acquisition.  Its nice to see more HPC
> > expansion
> >  > in our otherwise barren hometown... :)
> >  >
> >  > Maybe one of them will pass along some detail on the machine...
> >  > ___
> >  > Beowulf mailing list, Beowulf@beowulf.org
> > <mailto:Beowulf@beowulf.org> sponsored by Penguin Computing
> >  > To change your subscription (digest mode or unsubscribe) visit
> > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> >  >
> > ___
> > Beowulf mailing list, Beowulf@beowulf.org
> > <mailto:Beowulf@beowulf.org> sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> >
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] traverse @ princeton

2019-10-10 Thread Scott Atchley

That is better than 80% peak, nice.

Is it three racks of 15 nodes? Or two racks of 18 and 9 in the third rack?

You went with a single-port HCA per socket and not the shared, dual-port
HCA in the shared PCIe slot?

On Thu, Oct 10, 2019 at 8:48 AM Bill Wichser  wrote:

> Thanks for the kind words.  Yes, we installed more like a mini-Sierra
> machine which is air cooled.  There are 46 nodes of the IBM AC922, two
> socket, 4 V100 where each socket uses the SMT threading x4.  So two 16
> core chips, 32/node, 128 threads per node.  The GPUs all use NVLink.
>
> There are two EDR connections per host, each tied to a CPU, 1:1 per rack
> of 12 and 2:1 between racks.  We have a 2P scratch filesystem running
> GPFS.  Each node also has a 3T NVMe card as well for local scratch.
>
> And we're running Slurm as our scheduler.
>
> We'll see if it makes the top500 in November.  It fits there today but
> who knows what else got on there since June.  With the help of nVidia we
> managed to get 1.09PF across 45 nodes.
>
> Bill
>
> On 10/10/19 7:45 AM, Michael Di Domenico wrote:
> > for those that may not have seen
> >
> >
> https://insidehpc.com/2019/10/traverse-supercomputer-to-accelerate-fusion-research-at-princeton/
> >
> > Bill Wischer and Prentice Bisbal are frequent contributors to the
> > list, Congrats on the acquisition.  Its nice to see more HPC expansion
> > in our otherwise barren hometown... :)
> >
> > Maybe one of them will pass along some detail on the machine...
> > ___
> > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> >
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

[Beowulf] Exascale Day (10/18 aka 10^18)

2019-10-04 Thread Scott Atchley

Cray is hosting an online panel with speakers from ANL, LLNL, ORNL, ECP,
and Cray on Oct. 18:

https://www.cray.com/resources/exascale-day-panel-discussion
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] [EXTERNAL] Re: HPE completes Cray acquisition

2019-09-27 Thread Scott Atchley

Cray: This one goes up to 10^18

On Fri, Sep 27, 2019 at 12:08 PM Christopher Samuel 
wrote:

> On 9/27/19 7:40 AM, Lux, Jim (US 337K) via Beowulf wrote:
>
> > “A HPE company” seems sort of bloodless and corporate.  I would kind of
> > hope for  something like “CRAY – How Fast Do You Want to Go?” or
> > something like that to echo back to their long history of “just make it
> > fast”
>
> "Cray: this one goes up to 11"
>
> --
>Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] HPE completes Cray acquisition

2019-09-25 Thread Scott Atchley

These companies compliment each other. HPE is working on some very cool
technologies and their purchasing power should help reduce costs. Cray has
experience with the leadership-scale systems, several generations of HPC
interconnects, and optimizing scientific software.

We are waiting to find out which logo(s) will be on the Frontier when it
lands.

On Wed, Sep 25, 2019 at 12:09 PM Christopher Samuel 
wrote:

> Cray joins SGI as part of the HPE stable:
>
>
> https://www.hpe.com/us/en/newsroom/press-release/2019/09/hpe-completes-acquisition-of-supercomputing-leader-cray-inc.html
>
>  > As part of the acquisition, Cray president and CEO Peter Ungaro, will
> join HPE as head of the HPC and AI business unit in Hybrid IT.
>
> All the best,
> Chris
> --
>Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Titan is no more

2019-08-05 Thread Scott Atchley

Thanks for the link. I took my family in to get a photo in front of Titan
at 9 am knowing that I would miss the shutdown.

On Mon, Aug 5, 2019 at 12:12 PM Alex Chekholko  wrote:

> No need to imagine; here is the video of the shutdown:
>
> https://www.linkedin.com/posts/buddy-bland-b6b1b0111_titan-ornl-olcf-activity-6563128357154762753-ce3m
>
>
> On Sun, Aug 4, 2019 at 1:19 PM Scott Atchley 
> wrote:
>
>> Hi everyone,
>>
>> Titan completed it last job Friday and was powered down at 1 pm. I
>> imagine the room was a lot quieter after that.
>>
>> Once Titan and the other systems in the room are removed, work will begin
>> on putting in the new, stronger floor that will hold Frontier.
>>
>> Scott
>> ___
>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

[Beowulf] Titan is no more

2019-08-04 Thread Scott Atchley

Hi everyone,

Titan completed it last job Friday and was powered down at 1 pm. I imagine
the room was a lot quieter after that.

Once Titan and the other systems in the room are removed, work will begin
on putting in the new, stronger floor that will hold Frontier.

Scott
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] HPE to acquire Cray

2019-05-20 Thread Scott Atchley

Geez, I take one day of vacation and this happens. My phone was lit up all
day.

On Fri, May 17, 2019 at 1:20 AM Kilian Cavalotti <
kilian.cavalotti.w...@gmail.com> wrote:

>
> https://www.bloomberg.com/news/articles/2019-05-17/hp-enterprise-said-to-near-deal-to-buy-supercomputer-maker-cray-jvrfiu79
>
> Cheers,
> --
> Kilian
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] LFortran ... a REPL/Compiler for Fortran

2019-03-25 Thread Scott Atchley

Hmm, how does this compare to Flang
?

On Sun, Mar 24, 2019 at 12:33 PM Joe Landman  wrote:

> See https://docs.lfortran.org/ .   Figured Jeff Layton would like this :D
>
>
> --
> Joe Landman
> e: joe.land...@gmail.com
> t: @hpcjoe
> w: https://scalability.org
> g: https://github.com/joelandman
> l: https://www.linkedin.com/in/joelandman
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Large amounts of data to store and process

2019-03-13 Thread Scott Atchley

I agree with your take about slower progress on the hardware front and that
software has to improve. DOE funds several vendors to do research to
improve technologies that will hopefully benefit HPC, in particular, as
well as the general market. I am reviewing a vendor's latest report on
micro-architectural techniques to improve performance (e.g., lower latency,
increase bandwidth). For this study, they use a combination of DOE
mini-apps/proxies as well as commercial benchmarks. The techniques that
this vendor investigated showed potential improvements for commercial
benchmarks but much less, if any, for the DOE apps, which are highly
optimized.

I will state that I know nothing about Julia, but I assume it is a
higher-level language than C/C++ (or Fortran for numerical codes). I am
skeptical that a higher-level language (assuming Julia is) can help. I
believe the vendor's techniques that I am reviewing benefited commercial
benchmarks because they are less optimized than the DOE apps. Using a
high-level language relies on the language's compiler/interpreter and
runtime. The developer has no idea what is happening or does not have the
ability to improve it if profiling shows that the issue is in the runtime.
I believe that if you need more performance, you will have to work for it
in a lower-level language and there is no more free lunch (i.e., hoping the
latest hardware will do it for me).

Hope I am wrong.


On Wed, Mar 13, 2019 at 5:23 PM Douglas Eadline 
wrote:

>
> I realize it is bad form to reply ones own post and
> I forgot to mention something.
>
> Basically the HW performance parade is getting harder
> to celebrate. Clock frequencies have been slowly
> increasing while cores are multiply rather quickly.
> Single core performance boosts are mostly coming
> from accelerators. Added to the fact that speculation
> technology when managed for security, slows things down.
>
> What this means, the focus on software performance
> and optimization is going to increase because we can just
> buy new hardware and improve things anymore.
>
> I believe languages like Julia can help with this situation.
> For a while.
>
> --
> Doug
>
> >> Hi All,
> >> Basically I have sat down with my colleague and we have opted to go down
> > the route of Julia with JuliaDB for this project. But here is an
> > interesting thought that I have been pondering if Julia is an up and
> > coming fast language to work with for large amounts of data how will
> > that
> >> affect HPC and the way it is currently used and HPC systems created?
> >
> >
> > First, IMO good choice.
> >
> > Second a short list of actual conversations.
> >
> > 1) "This code is written in Fortran." I have been met with
> > puzzling looks when I say the the word "Fortran." Then it
> > comes, "... ancient language, why not port to modern ..."
> > If you are asking that question young Padawan you have
> > much to learn, maybe try web pages"
> >
> > 2) I'll just use Python because it works on my Laptop.
> > Later, "It will just run faster on a cluster, right?"
> > and "My little Python program is now kind-of big and has
> > become slow, should I use TensorFlow?"
> >
> > 3) 
> > "Dammit Jim, I don't want to learn/write Fortran,C,C++ and MPI.
> > I'm a (fill in  domain specific scientific/technical position)"
> > 
> >
> > My reply,"I agree and wish there was a better answer to that question.
> > The computing industry has made great strides in HW with
> > multi-core, clusters etc. Software tools have always lagged
> > hardware. In the case of HPC it is a slow process and
> > in HPC the whole programming "thing" is not as "easy" as
> > it is in other sectors, warp drives and transporters
> > take a little extra effort.
> >
> > 4) Then I suggest Julia, "I invite you to try Julia. It is
> > easy to get started, fast, and can grow with you application."
> > Then I might say, "In a way it is HPC BASIC, it you are old
> > enough you will understand what I mean by that."
> >
> > The question with languages like Julia (or Chapel, etc) is:
> >
> >   "How much performance are you willing to give up for convenience?"
> >
> > The goal is to keep the programmer close to the problem at hand
> > and away from the nuances of the underlying hardware. Obviously
> > the more performance needed, the closer you need to get to the hardware.
> > This decision goes beyond software tools, there are all kinds
> > of cost/benefits that need to be considered. And, then there
> > is IO ...
> >
> > --
> > Doug
> >
> >
> >
> >
> >
> >
> >
> >> Regards,
> >> Jonathan
> >> -Original Message-
> >> From: Beowulf  On Behalf Of Michael Di
> > Domenico
> >> Sent: 04 March 2019 17:39
> >> Cc: Beowulf Mailing List 
> >> Subject: Re: [Beowulf] Large amounts of data to store and process On
> > Mon, Mar 4, 2019 at 8:18 AM Jonathan Aquilina
> > 
> >> wrote:
> >>> As previously mentioned we donÃ¢Â€Â™t really need to have anything
> >>> indexed
> > so I am thinking flat files are the way to go my only

Re: [Beowulf] Introduction and question

2019-02-23 Thread Scott Atchley

Yes, you belong. :-)

On Sat, Feb 23, 2019 at 9:41 AM Will Dennis  wrote:

> Hi folks,
>
>
>
> I thought I’d give a brief introduction, and see if this list is a good
> fit for my questions that I have about my HPC-“ish” infrastructure...
>
>
>
> I am a ~30yr sysadmin (“jack-of-all-trades” type), completely self-taught
> (B.A. is in English, that’s why I’m a sysadmin :-P) and have ended up
> working at an industrial research lab for a large multi-national IT company
> (http://www.nec-labs.com). In our lab we have many research groups (as
> detailed on the aforementioned website) and a few of them are now using
> “HPC” technologies like Slurm, and I’ve become the lead admin for these
> groups. Having no prior background in this realm, I’m learning as fast as I
> can go :)
>
>
>
> Our “clusters” are collections of 5-30 servers, all collections bought
> over years and therefore heterogeneous hardware, all with locally-installed
> OS (i.e. not trad head-node with PXE-booted diskless minions) which is as
> carefully controlled as I can make it via standard OS install via Cobbler
> templates, and then further configured via config management (we use
> Ansible.) Networking is basic 10GbE between nodes (we do have Infiniband
> availability on one cluster, but it’s fell into disuse now since the
> project that has required it has ended.) Storage is one or more traditional
> NFS servers (some use ZFS, some not.) We have within the past few years
> adopted Slurm WLM for a job-scheduling system on top of these collections,
> and now are up to three different Slurm clusters, with I believe a fourth
> on the way.
>
>
>
> My first question for this list is basically “do I belong here?” I feel
> there’s a lot of HPC concepts it would be good for me to learn, so as I can
> improve the various research group’s computing environments, but not sure
> if this list is for much larger “true HPC” environments, or would be a good
> fit for a “HPC n00b” like me...
>
>
>
> Thanks for reading, and let me know your opinions :)
>
>
>
> Best,
>
> Will
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Simulation for clusters performance

2019-01-04 Thread Scott Atchley

You may also want to look at Sandia's Structural Simulation Toolkit
 and Argonne's CODES
.

On Thu, Jan 3, 2019 at 6:26 PM Benson Muite  wrote:

> There are a number of tools. A possible starting point is:
>
> http://spcl.inf.ethz.ch/Research/Scalable_Networking/SlimFly/
>
> Regards,
>
> Benson
> On 1/4/19 12:44 AM, Alexandre Ferreira Ramos wrote:
>
> Hi Eveybody,
>
> Happy new year!
>
> I need to conduct a study on simulating the performance of a large scale
> cluster with different network topology.
>
> Does anyone knows a simulation tool for verifying the performance of a
> large scale cluster of a given network topology?
>
> Thanks in advance for your time,
> Alexandre
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] If you can help ...

2018-11-09 Thread Scott Atchley

Done and I reposted your request on LinkedIn as well.

On Fri, Nov 9, 2018 at 8:28 AM Douglas Eadline  wrote:

>
> Everyone:
>
> This is a difficult email to write. For years we (Lara Kisielewska,
> Tim Wilcox, Don Becker, myself, and many others) have organized
> and staffed the Beowulf Bash each Monday night at SC.
>
> The event has always been funded by the various vendors who
> are part of the Beowulf ecosystem. This year, for what ever reason,
> we have come up short on sponsors. Contracts have been signed and
> we have cut back where possible. We still have a non-trivial shortage
> of around $10K (which the organizers have to come up with)
>
> So we are asking if anyone in the community (or your company
> or organization) can help, please visit our Go Fund Me page
> and kick in a few bucks. We even put together a few ClusterMonkey.net
> premiums. Thanks!
>
>   https://www.gofundme.com/beobash-2018
>
> TO BE CLEAR: The Beowulf Bash is still happening thanks to
> generous sponsors listed on our invite page, please come
> and enjoy yourself. Here is the invite page:
>
>   https://beowulfbash.com/
>
> See you there!
>
> --
> Doug
>
> --
> MailScanner: Clean
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] If I were specifying a new custer...

2018-10-11 Thread Scott Atchley

What do your apps need?

• Lots of memory?

Perhaps Power9 or Naples with 8 memory channels? Also, Cavium ThunderX2.

• More memory bandwidth?

Same as above.

• Max single thread performance?

Intel or Power9?

• Are your apps GPU enabled? If not, do you have budget/time/expertise to
do the work?

If not, then stick with CPUs.

• Need lots of PCIe for accelerators, SSDs, NICs?

Naples.

• Interconnect?

If Intel, either of the above. Does OPA work with Naples (I don't know)? If
Power/ARM and possibly Naples, then IB.

Sorry to give the default computer science answer of "It depends."

Scott

On Thu, Oct 11, 2018 at 3:09 PM Douglas Eadline 
wrote:

> All:
>
> Over the last several months I have been reading about:
>
> 1) Spectre/meltdown
> 2) Intel Fab issues
> 3) Supermicro MB issues
>
> I started thinking, if I were going to specify a
> single rack cluster, what would I use?
>
> I'm assuming a general HPC workload (not deep learning or
> analytics) I need to choose Xeon/Epyc, IB/Omni,
> Lustre/Ceph/BeGFS, should all nodes have GPU?
>
> I'm interested what members of this list think?
>
>
> --
> Doug
>
> --
> MailScanner: Clean
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] New Spectre attacks - no software mitigation - what impact for HPC?

2018-07-17 Thread Scott Atchley

I saw that article as well. It seems like they are targeting using RISC-V
to build an accelerator. One could argue that you do not need speculation
within a GPU-like accelerator, but you have to get your performance from
very wide execution units with lots of memory requests in flight as a GPU
does today.

On Tue, Jul 17, 2018 at 8:19 AM, John Hearns via Beowulf <
beowulf@beowulf.org> wrote:

> This article is well worth a read, on European Exascale projects
>
> https://www.theregister.co.uk/2018/07/17/europes_exascale_
> supercomputer_chips/
>
> The automotive market seems to have got mixed in there also!
> The main thrust  dual ARM based and RISC-V
>
> Also I like the plexiglass air shroud pictured at Barcelona. I saw
> something similar at the HPE centre in Grenoble.
> Damn good idea.
>
>
>
>
>
>
>
> On 17 July 2018 at 13:07, Scott Atchley  wrote:
>
>> Hi Chris,
>>
>> They say that no announced silicon is vulnerable. Your link makes it
>> clear that no ISA is immune if the implementation performs speculative
>> execution. I think your point about two lines of production may make sense.
>> Vendors will have to assess vulnerabilities and the performance trade-off.
>>
>> Personally, I do not see a large HPC system being built out of
>> non-speculative hardware. You would need much more hardware to reach a
>> level of performance and the additional power could lead to a lower
>> performance per Watt (i.e., exceed the facility's power budget).
>>
>> Scott
>>
>> On Tue, Jul 17, 2018 at 2:33 AM, Chris Samuel  wrote:
>>
>>> On Tuesday, 17 July 2018 11:08:42 AM AEST Chris Samuel wrote:
>>>
>>> > Currently these new vulnerabilities are demonstrated on Intel & ARM,
>>> it will
>>> > be interesting to see if AMD is also vulnerable (I would guess so).
>>>
>>> Interestingly RISC-V claims immunity, and that looks like it'll be one
>>> of the
>>> two CPU architectures blessed by the Europeans in their Exascale project
>>> (along with ARM).
>>>
>>> https://riscv.org/2018/01/more-secure-world-risc-v-isa/
>>>
>>> All the best,
>>> Chris
>>> --
>>>  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
>>>
>>> ___
>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>
>>
>> ___
>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

2018-06-10 Thread Scott Atchley

On Sun, Jun 10, 2018 at 4:53 AM, Chris Samuel  wrote:

> On Sunday, 10 June 2018 1:22:07 AM AEST Scott Atchley wrote:
>
> > Hi Chris,
>
> Hey Scott,
>
> > We have looked at this _a_ _lot_ on Titan:
> >
> > A Multi-faceted Approach to Job Placement for Improved Performance on
> > Extreme-Scale Systems
> >
> > https://ieeexplore.ieee.org/document/7877165/
>
> Thanks! IEEE has it paywalled but it turns out ACM members can read it
> here:
>
> https://dl.acm.org/citation.cfm?id=3015021
>
> > This issue we have is small jobs "inside" large jobs interfering with the
> > larger jobs. The item that is easy to implement with our scheduler was
> > "Dual-Ended Scheduling". We set a threshold of 16 nodes to demarcate
> small.
> > Jobs using more than 16 nodes, schedule from the top/front of the list
> and
> > smaller schedule from the bottom/back of the list.
>
> I'm guessing for "list" you mean a list of nodes?


Yes. It may be specific to Cray/Moab.


>   It's an interesting idea
> and possibly something that might be doable in Slurm with some patching,
> for
> us it might be more like allocate sub-node jobs from the start of the list
> (to
> hopefully fill up holes left by other small jobs) and full node jobs from
> the
> end of the list (where here list is a set of nodes of the same weight).
>
> You've got me thinking... ;-)
>
> All the best!
> Chris
>

Good luck. If you want to discuss, please do not hesitate to ask. We have
another paper pending along the same lines.

Scott
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

2018-06-09 Thread Scott Atchley

Hi Chris,

We have looked at this _a_ _lot_ on Titan:

A Multi-faceted Approach to Job Placement for Improved Performance on
Extreme-Scale Systems

https://ieeexplore.ieee.org/document/7877165/

This issue we have is small jobs "inside" large jobs interfering with the
larger jobs. The item that is easy to implement with our scheduler was
"Dual-Ended Scheduling". We set a threshold of 16 nodes to demarcate small.
Jobs using more than 16 nodes, schedule from the top/front of the list and
smaller schedule from the bottom/back of the list.

Scott

On Sat, Jun 9, 2018 at 2:56 AM, Chris Samuel  wrote:

> On Saturday, 9 June 2018 12:39:02 AM AEST Bill Abbott wrote:
>
> > We set PriorityFavorSmall=NO and PriorityWeightJobSize to some
> > appropriately large value in slurm.conf, which helps.
>
> I guess that helps getting jobs going (and we use something similar), but
> my
> question was more about placement.   It's a hard one..
>
> --
>  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] HPC Systems Engineer Positions

2018-06-01 Thread Scott Atchley

We have three HPC Systems Engineer positions open in the Technology
Integration group within the National Center for Computational Science at
ORNL. All are available from http://jobs.ornl.gov.

On Fri, Jun 1, 2018 at 9:20 AM, Mahmood Sayed 
wrote:

> Hello fellow HPC community.
>
> I have potentially 2 available positions for HPC Systems Engineers at Oak
> Ridge National Labs in the US.  These are not publicly advertised
> positions.  We (Attain) have knowledge of these through contacts at DoE.
> These are indefinitely funded positions.
>
> If you are in the area (or willing to move to the Oak Ridge, Tennessee
> area), please send me your info ASAP.
>
> Thanks!
>
> *Mahmood Sayed*
>
> Specialist, High Performance Computing, Federal Services
> [image: http://www.attain.com/sites/default/files/logo.png]
> 430 Davis Drive, Suite 270 | Morrisville, NC 27560
>
> Cell: 919.475.6110
>
> Email: *masa...@attain.com* 
>
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Heterogeneity in a tiny (two-system cluster)?

2018-02-16 Thread Scott Atchley

If it is memory bandwidth limited, you may want to consider AMD's EPYC
which has 33% more bandwidth.

On Fri, Feb 16, 2018 at 3:41 AM, John Hearns via Beowulf <
beowulf@beowulf.org> wrote:

> Oh, and while you are at it.
> DO a bit of investigation on how the FVCOM model is optimised for use with
> AVX vectorisation.
> Hardware and clock speeds alone don't cut it.
>
>
> On 16 February 2018 at 09:39, John Hearns  wrote:
>
>> Ted,
>> I would go for the more modern system. you say yourself the first system
>> is two years old. In one or two years it will be out of warranty, and if a
>> component breaks you will have to decide to buy that component or just junk
>> they system.
>>
>>
>> Actually, having said that you should look at the FVCOM model and see how
>> well it scales on a multi-core system.
>> Intel are increasign core counts, but not clock speeds. PAradoxically in
>> the past you used to be able to get dual-core parts at over 3Ghz, which
>> don;t have many cores competing for bandwith to RAM.
>> The counter example to this is Skylake which has more channels to RAM,
>> makign for a more balannced system.
>>
>> I would go for a Skylake system, populate all the DIMM channels, and
>> quite honestly forget about runnign between two systems unless the size of
>> your models needs this.
>> Our latest Skylakes have 192Gbuytes of RAM for that reason. Int he last
>> generation this would sound like an unusual amount of RAM, but it makes
>> sense in the Skylake generation.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 15 February 2018 at 14:20, Tad Slawecki  wrote:
>>
>>>
>>> Hello, list -
>>>
>>> We are at a point where we'd like to explore a tiny cluster of two
>>> systems to speed up execution of the FVCOM circulation model. We already
>>> have a two-year-old  system with two 14-core CPUs (Xeon E-2680), and I have
>>> budget to purchase another system at this point, which we plan to directly
>>> connect via Infiniband. Should I buy an exact match, or go with the most my
>>> budget can handle (for example 2xXeon Gold 1630, 16-cores) under the
>>> assumption that the two-system cluster will operate at about the same speed
>>> *and* I can reap the benefits of the added performance when running smaller
>>> simulations independently?
>>>
>>> Our list owner already provided some thoughts:
>>>
>>> > I've always preferred homgenous clusters, but what you say is,
>>> > I think, quite plausible.  The issue you will have though is
>>> > ensuring that the application is built for the earliest of the
>>> > architectures so you don't end up using instructions for a newer
>>> > CPU on the older one (which would result in illegal instruction
>>> > crashes).
>>> >
>>> > But there may be other gotchas that others think of!
>>>
>>> Thank you ...
>>>
>>> Tad
>>> ___
>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>
>>
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Intel kills Knights Hill, Xeon Phi line "being revised"

2017-11-18 Thread Scott Atchley

Some of the research has already made it into products and more is slated
for future products. As with all research, some did not pan out, but that
is to be expected.

On Sat, Nov 18, 2017 at 7:58 AM, C Bergström <cbergst...@pathscale.com>
wrote:

> Actually, the better question is, which vendor received funds and actually
> made a useful solution that can go production with the deliverables. From
> my view it seems like history is repeating itself[1] and I wish more people
> would wake up. The top down approach to funding scientific research and the
> in-fighting between labs is just too much nonsense. If these research
> projects were a start-up, it would have failed hard.
>
>  [1] https://en.wikipedia.org/wiki/X87
>
>
>
> On Sat, Nov 18, 2017 at 8:50 PM, Scott Atchley <e.scott.atch...@gmail.com>
> wrote:
>
>> Hmm, can you name a large processor vendor who has not accepted US
>> government research funding in the last five years? See DOE's FastForward,
>> FastForward2, DesignForward, DesignForward2, and now PathForward.
>>
>> On Fri, Nov 17, 2017 at 9:18 PM, Jonathan Engwall <
>> engwalljonathanther...@gmail.com> wrote:
>>
>>> Maybe they felt married to government sponsorship while the competition
>>> has found a way to compete with itself.
>>> http://www.nag.co.za/2017/10/26/amd-launches-ryzen-processor
>>> -with-radeon-vega-graphics-for-notebooks/
>>> Maybe such a huge contract even looks too good to be true.
>>>
>>> On Thu, Nov 16, 2017 at 3:06 AM, Mikhail Kuzminsky <k...@free.net> wrote:
>>>
>>>>
>>>> Unfortunately I did not find the english version, but Andreas
>>>>>
>>>>
>>>> Essentially yes Xeon Phi is not continued, but a new design called
>>>>> Xeon-H is coming.
>>>>>
>>>> Yes, and Xeon-H has close to KNL codename - Knights Cove. May be some
>>>> important (for HPC) microarchitecture features will remain.
>>>> But in any case stop of Xeon Phi give pluses for new NEC SX-Aurora.
>>>>
>>>> Mikhail Kuzminsky
>>>>
>>>> Zelinsky Institute
>>>> of Organic Chemistry
>>>> Moscow
>>>>
>>>> ___
>>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin
>>>> Computing
>>>> To change your subscription (digest mode or unsubscribe) visit
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>
>>>
>>>
>>> ___
>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>>
>>
>> ___
>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Intel kills Knights Hill, Xeon Phi line "being revised"

2017-11-18 Thread Scott Atchley

Hmm, can you name a large processor vendor who has not accepted US
government research funding in the last five years? See DOE's FastForward,
FastForward2, DesignForward, DesignForward2, and now PathForward.

On Fri, Nov 17, 2017 at 9:18 PM, Jonathan Engwall <
engwalljonathanther...@gmail.com> wrote:

> Maybe they felt married to government sponsorship while the competition
> has found a way to compete with itself.
> http://www.nag.co.za/2017/10/26/amd-launches-ryzen-
> processor-with-radeon-vega-graphics-for-notebooks/
> Maybe such a huge contract even looks too good to be true.
>
> On Thu, Nov 16, 2017 at 3:06 AM, Mikhail Kuzminsky  wrote:
>
>>
>> Unfortunately I did not find the english version, but Andreas
>>>
>>
>> Essentially yes Xeon Phi is not continued, but a new design called Xeon-H
>>> is coming.
>>>
>> Yes, and Xeon-H has close to KNL codename - Knights Cove. May be some
>> important (for HPC) microarchitecture features will remain.
>> But in any case stop of Xeon Phi give pluses for new NEC SX-Aurora.
>>
>> Mikhail Kuzminsky
>>
>> Zelinsky Institute
>> of Organic Chemistry
>> Moscow
>>
>> ___
>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-13 Thread Scott Atchley

Are you logging something goes to the disk in the local case, but that is
competing for network bandwidth when NFS mounting?

On Wed, Sep 13, 2017 at 2:15 PM, Scott Atchley <e.scott.atch...@gmail.com>
wrote:

> Are you swapping?
>
> On Wed, Sep 13, 2017 at 2:14 PM, Andrew Latham <lath...@gmail.com> wrote:
>
>> ack, so maybe validate you can reproduce with another nfs root. Maybe a
>> lab setup where a single server is serving nfs root to the node. If you
>> could reproduce in that way then it would give some direction. Beyond that
>> it sounds like an interesting problem.
>>
>> On Wed, Sep 13, 2017 at 12:48 PM, Prentice Bisbal <pbis...@pppl.gov>
>> wrote:
>>
>>> Okay, based on the various responses I've gotten here and on other
>>> lists, I feel I need to clarify things:
>>>
>>> This problem only occurs when I'm running our NFSroot based version of
>>> the OS (CentOS 6). When I run the same OS installed on a local disk, I do
>>> not have this problem, using the same exact server(s).  For testing
>>> purposes, I'm using LINPACK, and running the same executable  with the same
>>> HPL.dat file in both instances.
>>>
>>> Because I'm testing the same hardware using different OSes, this
>>> (should) eliminate the problem being in the BIOS, and faulty hardware. This
>>> leads me to believe it's most likely a software configuration issue, like a
>>> kernel tuning parameter, or some other software configuration issue.
>>>
>>> These are Supermicro servers, and it seems they do not provide CPU
>>> temps. I do see a chassis temp, but not the temps of the individual CPUs.
>>> While I agree that should be the first thing I look at, it's not an option
>>> for me. Other tools like FLIR and Infrared thermometers aren't really an
>>> option for me, either.
>>>
>>> What software configuration, either a kernel a parameter, configuration
>>> of numad or cpuspeed, or some other setting, could affect this?
>>>
>>> Prentice
>>>
>>> On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
>>>
>>>> Beowulfers,
>>>>
>>>> I need your assistance debugging a problem:
>>>>
>>>> I have a dozen servers that are all identical hardware: SuperMicro
>>>> servers with AMD Opteron 6320 processors. Every since we upgraded to CentOS
>>>> 6, the users have been complaining of wildly inconsistent performance
>>>> across these 12 nodes. I ran LINPACK on these nodes, and was able to
>>>> duplicate the problem, with performance varying from ~14 GFLOPS to 64
>>>> GFLOPS.
>>>>
>>>> I've identified that performance on the slower nodes starts off fine,
>>>> and then slowly degrades throughout the LINPACK run. For example, on a node
>>>> with this problem, during first LINPACK test, I can see the performance
>>>> drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, downward trend
>>>> continues throughout the remaining tests. At the start of subsequent tests,
>>>> performance will jump up to about 9-10 GFLOPS, but then drop to 5-6 GLOPS
>>>> at the end of the test.
>>>>
>>>> Because of the nature of this problem, I suspect this might be a
>>>> thermal issue. My guess is that the processor speed is being throttled to
>>>> prevent overheating on the "bad" nodes.
>>>>
>>>> But here's the thing: this wasn't a problem until we upgraded to CentOS
>>>> 6. Where I work, we use a read-only NFSroot filesystem for our cluster
>>>> nodes, so all nodes are mounting and using the same exact read-only image
>>>> of the operating system. This only happens with these SuperMicro nodes, and
>>>> only with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked fine, and when I
>>>> installed CentOS 6 on a local disk, the nodes worked fine.
>>>>
>>>> Any ideas where to look or what to tweak to fix this? Any idea why this
>>>> is only occuring with RHEL 6 w/ NFS root OS?
>>>>
>>>>
>>> ___
>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>
>>
>>
>> --
>> - Andrew "lathama" Latham lath...@gmail.com http://lathama.com
>> <http://lathama.org> -
>>
>> ___
>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-13 Thread Scott Atchley

Are you swapping?

On Wed, Sep 13, 2017 at 2:14 PM, Andrew Latham  wrote:

> ack, so maybe validate you can reproduce with another nfs root. Maybe a
> lab setup where a single server is serving nfs root to the node. If you
> could reproduce in that way then it would give some direction. Beyond that
> it sounds like an interesting problem.
>
> On Wed, Sep 13, 2017 at 12:48 PM, Prentice Bisbal 
> wrote:
>
>> Okay, based on the various responses I've gotten here and on other lists,
>> I feel I need to clarify things:
>>
>> This problem only occurs when I'm running our NFSroot based version of
>> the OS (CentOS 6). When I run the same OS installed on a local disk, I do
>> not have this problem, using the same exact server(s).  For testing
>> purposes, I'm using LINPACK, and running the same executable  with the same
>> HPL.dat file in both instances.
>>
>> Because I'm testing the same hardware using different OSes, this (should)
>> eliminate the problem being in the BIOS, and faulty hardware. This leads me
>> to believe it's most likely a software configuration issue, like a kernel
>> tuning parameter, or some other software configuration issue.
>>
>> These are Supermicro servers, and it seems they do not provide CPU temps.
>> I do see a chassis temp, but not the temps of the individual CPUs. While I
>> agree that should be the first thing I look at, it's not an option for me.
>> Other tools like FLIR and Infrared thermometers aren't really an option for
>> me, either.
>>
>> What software configuration, either a kernel a parameter, configuration
>> of numad or cpuspeed, or some other setting, could affect this?
>>
>> Prentice
>>
>> On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
>>
>>> Beowulfers,
>>>
>>> I need your assistance debugging a problem:
>>>
>>> I have a dozen servers that are all identical hardware: SuperMicro
>>> servers with AMD Opteron 6320 processors. Every since we upgraded to CentOS
>>> 6, the users have been complaining of wildly inconsistent performance
>>> across these 12 nodes. I ran LINPACK on these nodes, and was able to
>>> duplicate the problem, with performance varying from ~14 GFLOPS to 64
>>> GFLOPS.
>>>
>>> I've identified that performance on the slower nodes starts off fine,
>>> and then slowly degrades throughout the LINPACK run. For example, on a node
>>> with this problem, during first LINPACK test, I can see the performance
>>> drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, downward trend
>>> continues throughout the remaining tests. At the start of subsequent tests,
>>> performance will jump up to about 9-10 GFLOPS, but then drop to 5-6 GLOPS
>>> at the end of the test.
>>>
>>> Because of the nature of this problem, I suspect this might be a thermal
>>> issue. My guess is that the processor speed is being throttled to prevent
>>> overheating on the "bad" nodes.
>>>
>>> But here's the thing: this wasn't a problem until we upgraded to CentOS
>>> 6. Where I work, we use a read-only NFSroot filesystem for our cluster
>>> nodes, so all nodes are mounting and using the same exact read-only image
>>> of the operating system. This only happens with these SuperMicro nodes, and
>>> only with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked fine, and when I
>>> installed CentOS 6 on a local disk, the nodes worked fine.
>>>
>>> Any ideas where to look or what to tweak to fix this? Any idea why this
>>> is only occuring with RHEL 6 w/ NFS root OS?
>>>
>>>
>> ___
>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
>
>
> --
> - Andrew "lathama" Latham lath...@gmail.com http://lathama.com
>  -
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Poor bandwith from one compute node

2017-08-17 Thread Scott Atchley

I would agree that the bandwidth points at 1 GigE in this case.

For IB/OPA cards running slower than expected, I would recommend ensuring
that they are using the correct amount of PCIe lanes.

On Thu, Aug 17, 2017 at 12:35 PM, Joe Landman  wrote:

>
>
> On 08/17/2017 12:00 PM, Faraz Hussain wrote:
>
>> I noticed an mpi job was taking 5X longer to run whenever it got the
>> compute node lusytp104 . So I ran qperf and found the bandwidth between it
>> and any other nodes was ~100MB/sec. This is much lower than ~1GB/sec
>> between all the other nodes. Any tips on how to debug further? I haven't
>> tried rebooting since it is currently running a single-node job.
>>
>> [hussaif1@lusytp114 ~]$ qperf lusytp104 tcp_lat tcp_bw
>> tcp_lat:
>> latency  =  17.4 us
>> tcp_bw:
>> bw  =  118 MB/sec
>> [hussaif1@lusytp114 ~]$ qperf lusytp113 tcp_lat tcp_bw
>> tcp_lat:
>> latency  =  20.4 us
>> tcp_bw:
>> bw  =  1.07 GB/sec
>>
>> This is separate issue from my previous post about a slow compute node. I
>> am still investigating that per the helpful replies. Will post an update
>> about that once I find the root cause!
>>
>
> Sounds very much like it is running over gigabit ethernet vs Infiniband.
> Check to make sure it is using the right network ...
>
>
>> ___
>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
> --
> Joe Landman
> e: joe.land...@gmail.com
> t: @hpcjoe
> w: https://scalability.org
> g: https://github.com/joelandman
> l: https://www.linkedin.com/in/joelandman
>
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Hyperthreading and 'OS jitter'

2017-07-22 Thread Scott Atchley

I would imagine the answer is "It depends". If the application uses the
per-CPU caches effectively, then performance may drop when HT shares the
cache between the two processes.

We are looking at reserving a couple of cores per node on Summit to run
system daemons if the use requests. If the user can effectively use the
GPUs, the CPUs should be idle much of the time anyway. We will see.

I like you idea of a low power core to run OS tasks.

On Sat, Jul 22, 2017 at 6:11 AM, John Hearns via Beowulf <
beowulf@beowulf.org> wrote:

> Several times in the past I have jokingly asked if there shoudl eb another
> lower powered CPU core ina  system to run OS tasks (hello Intel - are you
> listening?)
> Also int he past there was advice to get best possible throughpur on AMD
> Bulldozer CPUs to run only on every second core (as they share FPUs).
> When I managed a large NUMA system we used cpusets, and the OS ran in a
> smal l'boot cpuset' which was physically near the OS disks and IO cards.
>
> I had a thought about hyperthreading though. A few months ago we did a
> quick study with Blener rendering, and got 30% more througput with HT
> switched on. Also someone who I am workign with now would liek to assess
> the effect on their codes of HT on/HT off.
> I kow that HT has nromally not had any advantages with HPC type codes - as
> the core should be 100% flat out.
>
> I am thinking though - what woud be the effect of enabling HT, and usign a
> cgroup to constrain user codes to run on all the odd-numbered CPU cores,
> with the OS tasks on the even numbered ones?
> I would hope this would be at least performance neutral? Your thoughts
> please! Also thoughts on candidate benchmark programs to test this idea.
>
>
> John Hearns
>  ... John Hearns
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Register article on Epyc

2017-06-22 Thread Scott Atchley

Hi Mark,

I agree that these are slightly noticeable but they are far less than
accessing a NIC on the "wrong" socket, etc.

Scott

On Thu, Jun 22, 2017 at 9:26 AM, Mark Hahn  wrote:

> But now, with 20+ core CPUs, does it still really make sense to have
>> dual socket systems everywhere, with NUMA effects all over the place
>> that typical users are blissfully unaware of?
>>
>
> I claim single-socket systems already have NUMA effects, since multiple
> layers (differently-shared) of cache have the same effect as memory at
> different hop-distances.
>
> regards, mark hahn.
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Register article on Epyc

2017-06-21 Thread Scott Atchley

In addition to storage, if you use GPUs for compute, the single socket is
compelling. If you rely on the GPUs for the parallel processing, then the
CPUs are just for serial acceleration and handling I/O. A single socket
with 32 cores and 128 lanes of PCIe can handle up to eight GPUs with four
CPU cores per GPU. This would be a very dense solution and could be
attractive for data centers as well as HPC.

On Wed, Jun 21, 2017 at 12:39 PM, Kilian Cavalotti <
kilian.cavalotti.w...@gmail.com> wrote:

> On Wed, Jun 21, 2017 at 5:39 AM, John Hearns 
> wrote:
> > For a long time the 'sweet spot' for HPC has been the dual socket Xeons.
>
> True, but why? I guess because there wasn't many other options, and in
> the first days of multicore CPUs, it was the only way to have decent
> local parallelism, even with QPI (and its ancestors) being a
> bottleneck. And also to have enough PCIe lanes (40 lanes ought to
> enough for anyone, right?)
>
> But now, with 20+ core CPUs, does it still really make sense to have
> dual socket systems everywhere, with NUMA effects all over the place
> that typical users are blissfully unaware of?
>
> Seems to me like this is a smart design move from AMD, and that
> single-socket systems, with 20+ core CPUs and 128 PCIe lanes could
> make a very cool base for many HPC systems. Of course, that's just on
> paper for now, proper benchmarking will be required.
>
> Cheers,
> --
> Kilian
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Register article on Epyc

2017-06-21 Thread Scott Atchley

The single socket versions make sense for storage boxes that can use RDMA.
You can have two EDR ports out the front using 16 lanes each. For the
storage, you can have 32-64 lanes internally or out the back for NVMe. You
even have enough lanes for two ports of HDR, when it is ready, and 48-64
lanes for the storage.

On Wed, Jun 21, 2017 at 8:39 AM, John Hearns  wrote:

> https://www.theregister.co.uk/2017/06/20/amd_epyc_launch/
>
> Interesting to see that these are being promoted as single socket systems.
> For a long time the 'sweet spot' for HPC has been the dual socket Xeons.
> I would speculate about single socket AMD systems, with a smaller form
> facotr motherboard, maybe with onboard Infiniband.  Put a lot of these
> cards in a chassis and boot them disklessly and you get a good amoutn of
> compute power.
>
> Also regarding compute power, it would be interesting to see a comparison
> of a single socket of these versus Xeon Phi rather than -v4 or -v5 Xeon.
>
> The encrypted RAM modes are interesting, however I can't see any use case
> for HPC.
> Unless you are running a cloudy cluster where your customers are VERY
> concerned about security.  Of course there are such customers!
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Suggestions to what DFS to use

2017-02-15 Thread Scott Atchley

Hi Chris,

Check with me in about a year.

After using Lustre for over 10 years to initially serve ~10 PB of disk and
now serve 30+ PB with very nice DDN gear, later this year we will be
installing 320 PB (250 PB useable) of GPFS (via IBM ESS storage units) to
support Summit, our next gen HPC system from IBM with Power9 CPUs and
NVIDIA Volta GPUs. Our current Lustre system is capable of 1 TB/s for large
sequential writes, but random write performance is much lower (~400 GB/s or
40% of sequential). The target performance for GPFS will be 2.5 TB/s
sequential writes and 2.2 TB/s random (~90% of sequential). The initial
targets are slightly lower, but we are supposed to achieve these rates by
2019.

We are very familiar with Lustre, the good and the bad, and ORNL is the
largest contributor to the Lustre codebase outside of Intel. We have
encountered many bugs at our scale that few other sites can match and we
have tested patches for Intel before their release to see how they perform
at scale. We have been testing GPFS for the last three years in preparation
for the change and IBM has been a very good partner to understand our
performance and scale issues. Improvements that IBM are adding to support
the CORAL systems will also benefit the larger community.

People are attracted to the "free" aspect of Lustre (in addition to the
open source), but it is not truly free. For both of our large Lustre
systems, we bought block storage from DDN and we added Lustre on top. We
have support contracts with DDN for the hardware and Intel for Lustre as
well as a large team within our operations to manage Lustre and a full-time
Lustre developer. The initial price is lower, but at this scale running
without support contracts and an experienced operations team is untenable.
IBM is proud of GPFS and their ESS hardware (i.e. licenses and hardware are
expensive) and they also require support contracts, but the requirements
for operations staff is lower. It is probably more expensive than any other
combination of hardware/licenses/support, but we have one vendor to blame,
which our management sees as a value.

As I said, check back in a year or two to see how this experiment works out.

Scott

On Wed, Feb 15, 2017 at 1:53 AM, Christopher Samuel 
wrote:

> Hi John,
>
> On 15/02/17 17:33, John Hanks wrote:
>
> > So "clusters" is a strong word, we have a collection of ~22,000 cores of
> > assorted systems, basically if someone leaves a laptop laying around
> > unprotected we might try to run a job on it. And being bioinformatic-y,
> > our problem with this and all storage is metadata related. The original
> > procurement did not include dedicated NSD servers (or extra GPFS server
> > licenses) so we run solely off the SFA12K's.
>
> Ah right, so these are the embedded GPFS systems from DDN. Interesting
> as our SFA10K's hit EOL in 2019 and so (if our funding continues beyond
> 2018) we'll need to replace them.
>
> > Could we improve with dedicated NSD frontends and GPFS clients? Yes,
> > most certainly. But again, we can stand up a PB or more of brand new
> > SuperMicro storage fronted by BeeGFS  that performs as well or better
> > for around the same cost, if not less.
>
> Very nice - and for what you're doing it sounds like just what you need.
>
> > I don't have enough of an
> > emotional investment in GPFS or DDN to convince myself that suggesting
> > further tuning that requires money and time is worthwhile for our
> > environment. It more or less serves the purpose it was bought for, we
> > learn from the experience and move on down the road.
>
> I guess I'm getting my head around how other sites GPFS performs given I
> have a current sample size of 1 and that was spec'd out by IBM as part
> of a large overarching contract. :-)
>
> I guess I assuming that because that was what we had it was how most
> sites did it, apologies for that!
>
> All the best,
> Chris
> --
>  Christopher SamuelSenior Systems Administrator
>  VLSCI - Victorian Life Sciences Computation Initiative
>  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>  http://www.vlsci.org.au/  http://twitter.com/vlsci
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] genz interconnect?

2016-10-12 Thread Scott Atchley

The Gen-Z site looks like it has a detailed FAQ. The CCIX FAQ is a little
more sparse. The ARM link you posted is a good overview.

On Wed, Oct 12, 2016 at 8:11 AM, Michael Di Domenico  wrote:

> anyone have any info on this?  there isn't much out there on the web.
> the arm.com link has more detail then the actual website.
>
> http://genzconsortium.org/news-type/press-release/
> https://community.arm.com/groups/processors/blog/2016/
> 10/11/how-do-amba-ccix-and-genz-address-the-needs-of-data-center
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] AMD cards with integrated SSD slots

2016-07-27 Thread Scott Atchley

None have AMD CPUs? Number three Titan has AMD Interlagos CPUs and NVIDIA
GPUs.

Given that the Fiji can access HBM at 512 GB/s, accessing NVM at 4 GB/s
will feel rather slow albeit much better than 1-2 GB/s connected to the
motherboard's PCIe.

On Wed, Jul 27, 2016 at 5:53 PM, Brian Oborn  wrote:

> I noticed that AMD is coming out with a new line of Pro cards that have
> two PCIe 3.0 M.2 slots for caching 1TB of content on the card.
>
>
> http://www.anandtech.com/show/10518/amd-announces-radeon-pro-ssg-fiji-with-m2-ssds-onboard
>
> I'm wondering what the list's thoughts are on the following questions:
>
> 1) Is AMD relevant in HPC anymore? In the Top500 list I only found two
> older systems that had AMD video cards and none that had AMD CPUs. Does
> anyone here run smaller newer AMD clusters?
>
> 2) Are there many workloads that would benefit from this type of cache? I
> think trying to juggle GPU RAM, system RAM, GPU storage, system storage,
> and inter-node networking might make this too difficult to scale beyond a
> single system?
>
> Thanks,
>
> Brian Oborn
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] NFS HPC survey results.

2016-07-22 Thread Scott Atchley

Did you mean IB over Ethernet (IBoE)? I thought IB over IP has been around
long before RoCE.

On Thu, Jul 21, 2016 at 7:34 PM, Christopher Samuel 
wrote:

> Thanks so much Bill, very much appreciated!
>
> On 21/07/16 09:19, Bill Broadley wrote:
>
> > 15) If IB what transport (10 responses)
> > 100% IPoIB
> >0% Other
>
> I learnt of one group yesterday who have recently started using NFS over
> RDMA over "IB over IP" (what used to be called RoCE) on 100gigE Mellanox
> gear.  It's too early to know how stable it is but they do report much
> improved performance, especially with latency.
>
> All the best,
> Chris
> --
>  Christopher SamuelSenior Systems Administrator
>  VLSCI - Victorian Life Sciences Computation Initiative
>  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>  http://www.vlsci.org.au/  http://twitter.com/vlsci
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Anyone using Apache Mesos?

2015-11-11 Thread Scott Atchley

Someone asked me and I said I would ask around.

Thanks,

Scott
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Semour Cray 90th Anniversary

2015-10-14 Thread Scott Atchley

On Wed, Oct 14, 2015 at 3:58 PM, Prentice Bisbal <
prentice.bis...@rutgers.edu> wrote:

> On 10/03/2015 01:54 PM, Nathan Pimental wrote:
>
> Very nice article. Are cray computers still made, and how popular are
> they? How pricey are they? :)
>
>
> Yes, Argonne National Lab (ANL) announced in April it will purchase a
> large Cray system as part of the CORAL intitative at a price of $200
> million. Expected performance is 180 PFLOPs.
>
> https://www.alcf.anl.gov/articles/introducing-aurora
>
> Interesting, ANL has long been an IBM shop (Intrepid, Mira) and ORNL has
> been a Cray shop (Jaguar, Titan)  but that's switching with the CORAL
> purchases. I guess they want to keep their admins and developers on their
> toes.
>

Clearly, the admins would have kept things the same. :-)

What did not change for either lab is the hardware model. ORNL's Titan is a
heterogeneous system with CPUs and GPUs while ANL's Mira is a homogeneous
system. ORNL's Summit will have CPUs and GPUs and ANL's Aurora will have a
homogeneous, many-core processor.
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Accelio

2015-08-20 Thread Scott Atchley

They are using this as a basis for the XioMessenger within Ceph to get RDMA
support.

On Thu, Aug 20, 2015 at 9:24 AM, John Hearns john.hea...@xma.co.uk wrote:

 I saw this mentioned on the Mellanox site. Has anyone come across it:



 http://www.accelio.org/



 Looks interesting.







 Dr. John Hearns
 Principal HPC Engineer
 Product Development

 T:
 M:
 F:


 01727 201 800
 07432 647 511
 01727 201 814


 Visit us at www.xma.co.uk
 Follow us @WeareXMA https://www.twitter.com/WeareXMA


 *XMA*
 7 Handley Page Way
 Old Parkbury Lane
 Colney Street
 St. Albans
 Hertfordshire
 AL2 2DQ



 [image: We are XMA.] http://www.xma.co.uk/



 [image: XMA] http://www.xma.co.uk/





 --

 Scanned by *MailMarshal* - M86 Security's comprehensive email content
 security solution.

 --
 Any views or opinions presented in this email are solely those of the
 author and do not necessarily represent those of the company. Employees of
 XMA Ltd are expressly required not to make defamatory statements and not to
 infringe or authorise any infringement of copyright or any other legal
 right by email communications. Any such communication is contrary to
 company policy and outside the scope of the employment of the individual
 concerned. The company will not accept any liability in respect of such
 communication, and the employee responsible will be personally liable for
 any damages or other liability arising. XMA Limited is registered in
 England and Wales (registered no. 2051703). Registered Office: Wilford
 Industrial Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP

 ___
 Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
 To change your subscription (digest mode or unsubscribe) visit
 http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] China aims for 100 PF

2015-07-17 Thread Scott Atchley

They will use a homegrown GPDSP (general purpose DSP) accelerator in lieu
of the Intel Knights Landing accelerators:

http://www.theplatform.net/2015/07/15/china-intercepts-u-s-intel-restrictions-with-homegrown-supercomputer-chips/

Also, hints about their interconnect and file system upgrades.

Scott
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Paper describing Google's queuing system Borg

2015-04-21 Thread Scott Atchley

Is Omega the successor? The Borg paper mentions Omega :

  Omega [69] supports multiple parallel, specialized “verti- cals” that are
each roughly equivalent to a Borgmaster minus its persistent store and link
shards. Omega schedulers use optimistic concurrency control to manipulate a
shared repre- sentation of desired and observed cell state stored in a cen-
tral persistent store, which is synced to/from the Borglets by a separate
link component. The Omega architecture was de- signed to support multiple
distinct workloads that have their

own application-specific RPC interface, state machines, and scheduling
policies (e.g., long-running servers, batch jobs from various frameworks,
infrastructure services like clus- ter storage systems, virtual machines
from the Google Cloud Platform). On the other hand, Borg offers a “one size
fits all” RPC interface, state machine semantics, and scheduler pol- icy,
which have grown in size and complexity over time as a result of needing to
support many disparate workloads, and scalability has not yet been a
problem (§3.4).



On Thu, Apr 16, 2015 at 12:20 PM, Deepak Singh mnd...@gmail.com wrote:

 Great to see something about the Borg design out there.  Google have also
 written about the successor to Borg, a framework called Omega.

 http://research.google.com/pubs/pub41684.html

 PDF: http://research.google.com/pubs/archive/41684.pdf


 On Thu, Apr 16, 2015 at 6:53 AM Chris Samuel sam...@unimelb.edu.au
 wrote:

 Hi all,

 This is a very recent (2015) paper describing the queuing system used by
 Google internally, called Borg.

 http://research.google.com/pubs/pub43438.html

 Full PDF available from there.

 Thought it might interest some folks!

 All the best,
 Chris
 --
  Christopher SamuelSenior Systems Administrator
  VLSCI - Victorian Life Sciences Computation Initiative
  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
  http://www.vlsci.org.au/  http://twitter.com/vlsci

 ___
 Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
 To change your subscription (digest mode or unsubscribe) visit
 http://www.beowulf.org/mailman/listinfo/beowulf


 ___
 Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
 To change your subscription (digest mode or unsubscribe) visit
 http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] interesting article on HPC vs evolution of 'big data' analysis

2015-04-09 Thread Scott Atchley

On Wed, Apr 8, 2015 at 9:56 PM, Greg Lindahl lind...@pbm.com wrote:

 On Wed, Apr 08, 2015 at 03:57:34PM -0400, Scott Atchley wrote:

  There is concern by some and outright declaration by others (including
  hardware vendors) that MPI will not scale to exascale due to issues like
  rank state growing too large for 10-100 million endpoints,

 That's weird, given that it's an implementation choice.


It is one of the concerns raised, but not the only one. No one is giving up
on MPI; that is not an alternative given the existing code base. There are
efforts to avoid duplication of rank information within a node (no need to
each rank to have duplicates) or use a single MPI rank per node and use
OpenMP/other to manage node-local parallelism at the risk of a large
many-core node's cores all trying to access the NIC at the same time.

I am not advocating for/against MPI or predicting its imminent demise, but
I am aware of the concerns by the vendors.



 Presumably Intel is keeping the PathScale tiny rank state as a
 feature?


One would expect, but that is probably necessary but not sufficient for
their many-core future.



 Reliability, now that's a serious issue! And not one that's trivially
 fixed for any problem that must be tightly-coupled.


Yes, and we are open to suggestions. ;-)
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] CephFS

2015-04-09 Thread Scott Atchley

No, but you might find this interesting:

http://dl.acm.org/citation.cfm?id=2538562

On Thu, Apr 9, 2015 at 11:24 AM, Tom Harvill u...@harvill.net wrote:


 Hello,

 Question: is anyone on this list using CephFS in 'production'?  If so,
 what are you using
 it for (ie. scratch/tmp, archive, homedirs)?  In our setup we use NFS
 shared ZFS for /home,
 Lustre for /work (performance-oriented shared fs), and job-specific tmp on
 the worker nodes
 local disk.

 What I really want to know is if anyone is using CephFS (without
 headache?) in a production
 HPC cluster in place of where one might use Lustre?

 Thanks!
 Tom
 ___
 Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
 To change your subscription (digest mode or unsubscribe) visit
 http://www.beowulf.org/mailman/listinfo/beowulf

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] interesting article on HPC vs evolution of 'big data' analysis

2015-04-08 Thread Scott Atchley

There is concern by some and outright declaration by others (including
hardware vendors) that MPI will not scale to exascale due to issues like
rank state growing too large for 10-100 million endpoints, lack of
reliability, etc. Those that make this claim then offer up their favorite
solution (a PGAS variant, Chapel, Legion, Open Community Runtime). Several
assert that the event-driven/task-driven runtimes will take care of data
partitioning, data movement, etc. and that the user only has to define
relationships and dependencies while exposing as much parallelism as
possible.

The domain scientists shudder at the thought of rewriting existing codes,
some of which have existed for decades. If they do get funding to rewrite,
which new programming model should they pick? At this point, there is no
clear favorite.



On Wed, Apr 8, 2015 at 2:31 PM, Prentice Bisbal prentice.bis...@rutgers.edu
 wrote:

 I got annoyed by this article and had to stop reading it. I'll go back
 later and try to give it a proper critique, but obviously disagree with
 most of what I've read so far. Right of the bat, the author implies that
 Big Data = HPC, and I disagree with that.

 More ranting to come

 Prentice


 On 04/08/2015 01:16 PM, H. Vidal, Jr. wrote:

 Curious as to what the body of thought is here on this article:

 http://www.dursi.ca/hpc-is-dying-and-mpi-is-killing-it/

 ___
 Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
 To change your subscription (digest mode or unsubscribe) visit
 http://www.beowulf.org/mailman/listinfo/beowulf


 ___
 Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
 To change your subscription (digest mode or unsubscribe) visit
 http://www.beowulf.org/mailman/listinfo/beowulf

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Mellanox Multi-host

2015-03-11 Thread Scott Atchley

Looking at this and the above link:

http://www.mellanox.com/page/press_release_item?id=1501

It seems that the OCP Yosemite is a motherboard that allows four compute
cards to be plugged into it. The compute cards can even have different CPUs
(x86, ARM, Power). The Yosemite board has the NIC and connection to the
switch. It is not clear if the multi-host connection is tunneled over the
PCIe connection between the compute card and the Yosemite board or if
network communication is handled over the compute card's NIC to the
aggregator on the Yosemite board. Expect it is tunneled over PCIe, but more
details would be nice.

It seems the whole OCP Yosemite project is geared towards avoiding NUMA and
using cheaper, simpler CPUs.

On Wed, Mar 11, 2015 at 8:51 AM, John Hearns hear...@googlemail.com wrote:

 Talking about 10Gbps networking... and above:


 http://www.theregister.co.uk/2015/03/11/mellanox_adds_networking_specs_to_ocp/

 In the configuration Mellanox demonstrated, a 648-node cluster would only
 need 162 each of NICs, ports and cables.

 So looks like one switch port can fan out to four hosts,
 and they talk about mixing FPGA and GPU
 Might make for a very interesting cluster.

 ___
 Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
 To change your subscription (digest mode or unsubscribe) visit
 http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Summit

2014-11-14 Thread Scott Atchley

This is what's next:

https://www.olcf.ornl.gov/summit/

Scott
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] 10Gb/s iperf test point (TCP) available ?

2010-10-15 Thread Scott Atchley

On Oct 14, 2010, at 10:37 PM, Christopher Samuel wrote:

 Apologies if this is off topic, but I'm trying to check
 what speeds the login nodes to our cluster and BlueGene
 can talk at and the only 10Gb/s iperf server I've been
 given access to so far (run by AARNET) showed me just
 under 1Gb/s.

Have you tried netperf? I have read that iperf calls gettimeofday() before and 
after each read/write which might mean you are measuring the BG syscall time 
more than network throughput time.

Scott
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] 48-port 10gig switches?

2010-09-02 Thread Scott Atchley

On Sep 2, 2010, at 12:58 PM, David Mathog wrote:

 A lot of 1 GbE switches use around 15W/port so I thought 10 GbE switches
 would be real fire breathers.  It doesn't look that way though, the
 power consumption cited here:
 
 http://www.voltaire.com/NewsAndEvents/Press_Releases/press2010/Voltaire_Announces_High_Density_10_GbE_Switch_for_Efficient_Scaling_of_Cloud_Networks
 
 is the industry’s lowest power consumption of 6.3 watts/port or 302W.
 I wonder what tricks they used to increase the speed to 10 GbE and drop
 the power consumption.

Not using 10GBase-T? It uses SFP+ which do not use much power.

Scott

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] OT: recoverable optical media archive format?

2010-06-10 Thread Scott Atchley

On Jun 10, 2010, at 3:20 PM, David Mathog wrote:

 Jesse Becker and others suggested:
 
http://users.softlab.ntua.gr/~ttsiod/rsbep.html
 
 I tried it and it works, mostly, but definitely has some warts.
 
 To start with I gave it a negative control - a file so badly corrupted
 it should NOT have been able to recover it.
 
 % ssh remotePC 'dd if=/dev/sda1 bs=8192' img.orig
 % cat img.orig  | bzip2 img.bz2.orig
 % cat img.bz2.orig  | rsbep  img.bz2.rsbep
 % cat img.bz2.rsbep | pockmark -maxgap 10 -maxrun 1
 img.bz2.rsbep.pox
 % cat img.bz2.rsbep.pox | rsbep -d -v img.bz2.restored
 rsbep: number of corrected failures   : 9725096
 rsbep: number of uncorrectable blocks : 0
 
 img.orig is a Windows XP partition with all empty space filled with
 0x0 bytes.  That is then compressed with bzip2, then run
 through rsbep (the one from the link above), then corrupted
 with pockmark.  Pockmark is my own little concoction, when used as
 shown  it stamps 0x0 bytes starting randomly every (1-MAXGAP) bytes, for
 a run of (1-MAXRUN).  In both cases the gap and run length are chosen at
 random from those ranges for each new gap/run.
 This should corrupt around 10% of the file, which I assumed would render
 it unrecoverable.  Notice in the file sizes below that the overall size
 did not change when the file was run through pockmark.  rsbep did not
 note any errors it couldn't correct. However, the
 size of the restored file is not the same as the orig.
 
 4056976560 2010-06-08 17:51 img.bz2.restored
 4639143600 2010-06-08 16:19 img.bz2.rsbep.pox
 4639143600 2010-06-08 16:13 img.bz2.rsbep
 4056879025 2010-06-08 14:40 img.bz2.orig
 20974431744 2010-06-07 15:23 img.orig
 
 % bunzip2 -tvv img.bz2.restored
  img.bz2.restored: 
[1: huff+mtf data integrity (CRC) error in data
 
 So at the very least rsbep sometimes says it has recovered a file when
 it has not.  I didn't really expect it to rescue this particular input,
 but it really should have handled it better.

I have never used this tool, but I would wonder if your pockmark tool damaged 
the rsbep metadata, specifically one or more of the metadata segment lengths. 
Bear in mind that corruption of the metadata is not beyond the realm of 
possibility, but I assume that the rsbep metadata is not replicated or 
otherwise protected.

 I reran it with a less damaged file like this:
 
 % cat img.bz2.rsbep | pockmark -maxgap 100 -maxrun 1
 img.bz2.rsbep.pox2
 % cat img.bz2.rsbep.pox2 | rsbep -d -v img.bz2.restored2
 rsbep: number of corrected failures   : 46025036
 rsbep: number of uncorrectable blocks : 0
 % bunzip2 img.bz2.restored2
 bunzip2: Can't guess original name for img.bz2.restored2 -- using
 img.bz2.restored2.out
 bunzip2: img.bz2.restored2: trailing garbage after EOF ignored
 % md5sum img.bz2.restored2.out img.orig
 7fbaec7143c3a17a31295a803641aa3c  img.bz2.restored2.out
 7fbaec7143c3a17a31295a803641aa3c  img.orig
 
 This time it was able to recover the corrupted file, but again, it
 created an output file which was a different size.  Is this always the
 case?   Seems to be at least for the size file used here:
 
 % cat img.bz2.orig | rsbep | rsbep -d  nopox.bz2
 
 nopox.bz2 is also 4056976560.   The decoded output is always 97535 bytes
 larger than the original, which may bear some relation to the
 z=ERR_BURST_LEN parameter as:
 
 97535 /765 = 127.496732
 
 which is suspiciously close to 255/2.  Or that could just be a coincidence.
 
 In any case, bunzip2 was able to handle the crud on the end, but this
 would have been a problem for other binary files.

This is most likely a requirement of the underlying Reed-Solomon library that 
requires equal length blocksizes. If your original file is N bytes and N % M != 
0 where M is the blocksize, I imagine it pads the last block with 0s so that it 
is N bytes. It should not affect bunzip since the length is encoded in the file 
and it ignores anything tacked onto the end.

A quick glance at his website, it claims that the length should be the same. He 
only shows, however, the md5sums and not the ls -l output.

Scott

 Tbe other thing that is frankly bizarre is the number of corrected
 failures for the 2nd case vs. the first.The 2nd should have 10X
 fewer bad bytes than the first, but the rsbep status messages
 indicate 4.73X MORE.  However, the number of bad bytes in the 2nd is
 almost exactly 1%, as it should be.  All of this suggests that rsbep
 does not handle correctly files which are too corrupted.  It gives the
 wrong number of corrected blocks and thinks that it has corrected
 everything when it has not done so.  Worse, even when it does work the
 output file was never (in any of the test cases) the same size as the
 input file.
 
 I think this program has potential but it needs a bit of work to sand
 the rough edges off.  I will have a look at it, but won't have a chance
 to do so for a couple of weeks.
 
 Regards,
 
 David Mathog
 mat...@caltech.edu
 Manager, Sequence Analysis Facility,

Re: [Beowulf] Q: IB message rate large core counts (per node)?

2010-02-24 Thread Scott Atchley

On Feb 23, 2010, at 6:16 PM, Brice Goglin wrote:

 Greg Lindahl wrote:
 now that I'm inventorying ignorance, I don't really understand why RDMA 
 always seems to be presented as a big hardware issue.  wouldn't it be 
 pretty easy to define an eth or IP-level protocol to do remote puts,
 gets, even test-and-set or reduce primitives, where the interrupt handler
 could twiddle registered blobs of user memory on the target side?
 
 
 That approach is called Active Messages, and can be bolted on to
 pretty much every messaging implementation. Doesn't OpenMX provide
 that kind of interface?
 
 
 Open-MX offers what MX offers: no explicit RDMA interface, only 2-sided.
 But something similar to a remote get is used internally for large
 messages. It wouldn't be hard to mplement some RDMA-like features in
 such a software-only model like Mark said above.
 
 Brice

Don't forget the unexpected handler which can provide some Active Message 
behavior.

Scott
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] which mpi library should I focus on?

2010-02-23 Thread Scott Atchley


On Feb 20, 2010, at 1:49 PM, Paul Johnson wrote:


What are the reasons to prefer one or the other?


Why choose? You can install both and test with your application to see  
if there is a performance difference (be sure to keep your runtime  
environment paths correct - don't mix libraries and MPI binaries).  
Your MPI code should adhere to the standard and both should run it  
correctly.


Scott
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Performance tuning for Jumbo Frames

2009-12-15 Thread Scott Atchley


On Dec 14, 2009, at 12:57 PM, Alex Chekholko wrote:


Set it as high as you can; there is no downside except ensuring all
your devices are set to handle that large unit size.  Typically, if  
the

device doesn't support jumbo frames, it just drops the jumbo frames
silently, which can result in odd intermittent problems.


You can test it by using the size parameter with ping:

$ ping -s size_in_bytes host

If they all drop, then you have exceeded the MTU of some device.

Scott
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Re: scalability

2009-12-10 Thread Scott Atchley


On Dec 10, 2009, at 9:56 AM, Jörg Saßmannshausen wrote:


I have heard of Open-MX before, do you need special hardware for that?


No, any Ethernet driver on Linux.

http://open-mx.org

Scott
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] mpd ..failed ..!

2009-11-16 Thread Scott Atchley


On Nov 14, 2009, at 7:24 AM, Zain elabedin hammade wrote:


I installed mpich2 - 1.1.1-1.fc11.i586.rpm .


You should ask this on the mpich list at:

https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


I wrote on every machine :

mpd 
mpdtrace -l


You started stand-alone MPD rings of size one on each host. This is  
incorrect. You should use mpdboot and a machine file.


$ mpdboot -f machinefile -n num_hosts ...

Scott
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] large scratch space on cluster

2009-09-29 Thread Scott Atchley


On Sep 29, 2009, at 10:09 AM, Jörg Saßmannshausen wrote:

However, I was wondering whether it does make any sense to somehow  
'export'
that scratch space to other nodes (4 cores only). So, the idea  
behind that
is, if I need a vast amount of scratch space, I could use the one in  
the 8
core node (the one I mentioned above). I could do that with nfs but  
I got the
feeling it will be too slow. Also, I only got GB ethernet at hand,  
so I
cannot use some other networks here. Is there a good way of doing  
that? Some
words like i-scsi and cluster-FS come to mind but to be honest, up  
to now I

never really worked with them.

Any ideas?

All the best

Jörg


I am under the impression that NFS can saturate a gigabit link.

If for some reason that it cannot, you might want to try PVFS2 (http://www.pvfs.org 
) over Open-MX (http://www.open-mx.org).


Scott
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] large scratch space on cluster

2009-09-29 Thread Scott Atchley


On Sep 29, 2009, at 1:13 PM, Scott Atchley wrote:


On Sep 29, 2009, at 10:09 AM, Jörg Saßmannshausen wrote:

However, I was wondering whether it does make any sense to somehow  
'export'
that scratch space to other nodes (4 cores only). So, the idea  
behind that
is, if I need a vast amount of scratch space, I could use the one  
in the 8
core node (the one I mentioned above). I could do that with nfs but  
I got the
feeling it will be too slow. Also, I only got GB ethernet at hand,  
so I
cannot use some other networks here. Is there a good way of doing  
that? Some
words like i-scsi and cluster-FS come to mind but to be honest, up  
to now I

never really worked with them.

Any ideas?

All the best

Jörg


I am under the impression that NFS can saturate a gigabit link.

If for some reason that it cannot, you might want to try PVFS2 (http://www.pvfs.org 
) over Open-MX (http://www.open-mx.org).


I should add that PVFS2 is meant to separate the metdata from IO and  
have mulitple IO servers. You can run it a single server with both  
metadata and IO, but it may not be much different than NFS.


Scott
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Re: typical latencies for gigabit ethernet

2009-06-29 Thread Scott Atchley


On Jun 29, 2009, at 12:10 PM, Dave Love wrote:


When I test Open-MX, I turn interrupt coalescing off. I run
omx_pingpong to determine the lowest latency (LL). If the NIC's  
driver

allows one to specify the interrupt value, I set it to LL-1.


Right, and that's what I did before, with sensible results I thought.
Repeating it now on Centos 5.2 and OpenSuSE 10.3, it doesn't behave
sensibly, and I don't know what's different from the previous SuSE
results apart, probably, from the minor kernel version.  If I set
rx-frames=0, I see this:

rx-useclatency (µs)
20 34.6
12 26.3
6  20.0
1  14.8

whereas if I just set rx-frames=1, I get 14.7 µs, roughly  
independently

of rx-usec.  (Those figures are probably ±∼0.2µs.)


That is odd. I have only tested with Intel e1000 and our myri10ge  
Ethernet driver. The Intel driver does not let you specify value other  
than certain settings (0, 25, etc.). The myri10ge driver does allow  
you to specify any value.


Your results may be specific to that driver.

Brice and Nathalie have a paper which implements an adaptive  
interrupt

coalescing so that you do not have to manually tune anything:


Isn't that only relevant if you control the firmware?  I previously
didn't really care about free firmware for devices in the same way as
free software generally, but am beginning to see reasons to care.


True, I believe that had to make two very small modifications to the  
myri10ge firmware.


I have head that some Ethernet drivers do or will support adaptive  
coalescing which may give better performance than manually tuning and  
without modifying the NIC firmware for OMX.


Scott
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Re: typical latencies for gigabit ethernet

2009-06-29 Thread Scott Atchley


On Jun 29, 2009, at 1:44 PM, Scott Atchley wrote:


Right, and that's what I did before, with sensible results I thought.
Repeating it now on Centos 5.2 and OpenSuSE 10.3, it doesn't behave
sensibly, and I don't know what's different from the previous SuSE
results apart, probably, from the minor kernel version.  If I set
rx-frames=0, I see this:

rx-useclatency (µs)
20 34.6
12 26.3
6  20.0
1  14.8

whereas if I just set rx-frames=1, I get 14.7 µs, roughly  
independently

of rx-usec.  (Those figures are probably ±∼0.2µs.)


That is odd. I have only tested with Intel e1000 and our myri10ge  
Ethernet driver. The Intel driver does not let you specify value  
other than certain settings (0, 25, etc.). The myri10ge driver does  
allow you to specify any value.


Your results may be specific to that driver.


As Patrick kindly pointed out, you are using rx-frames and not rx- 
usec. They are not equivalent.


Scott
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] 10 GbE

2009-02-11 Thread Scott Atchley


On Feb 11, 2009, at 7:57 AM, Igor Kozin wrote:


Hello everyone,
we are embarking on evaluation of 10 GbE for HPC and I was wondering  
if someone has already had experience with Arista 7148SX 48 port  
switch or/and Netxen cards. General pros and cons would be greatly  
appreciated and in particular

- Switch latency  (btw, the data sheet says x86 inside);


And it mentions 600 ns latency. I have not tested this switch  
myself. :-)



- Netxen NX3-20GxR card vs Intel 10 GbE AD DA card.
Many thanks in advance,
Igor


Hopefully, you broaden your NIC choice. ;-)

Scott

___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] tcp error: Need ideas!

2009-01-25 Thread Scott Atchley


On Jan 25, 2009, at 10:13 AM, Gerry Creager wrote:


-bash-3.2# ethtool -K rx off
no offload settings changed


You missed the interface here. You should try:

-bash-3.2# ethtool -K eth1 rx off


-bash-3.2# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off

But here's the one I love:
-bash-3.2# ethtool -K tso off
no offload settings changed


Again, you are missing the interface:

-bash-3.2# ethtool -K eth1 tso off


I apparently can't control things with ethtool...


Ethtool provides -S to let the driver report additional information.  
Does this report anything:


# ethtool -S eth1

Scott
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Odd SuperMicro power off issues

2008-12-08 Thread Scott Atchley


Hi Chris,

We had a customer with Opterons experience reboots with nothing in the  
logs, etc. The only thing we saw with ipmitool sel list was:


   1 | 11/13/2007 | 10:49:44 | System Firmware Error |

We traced to a HyperTransport deadlock, which by default reboots the  
node. Our engineer found this AMD note:


reset through sync-flooding is described in chapter 13.15 Error  
Handling in the following document:


http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/32559.pdf

When we changed the default PCI setting for this option (0x50) to off  
(i.e. no reboot, 0x40), the node did not reboot but it did hang and  
required a IPMI reboot.


Our working assumption is that the traffic of one particular  
application running over our NICs induced some pattern of traffic that  
caused a flow-control deadlock in HT.


Scott

On Dec 7, 2008, at 10:33 PM, Chris Samuel wrote:


Hi folks,

We've been tearing our hair out over this for a little
while and so I'm wondering if anyone else has seen anything
like this before, or has any thoughts about what could be
happening ?

Very occasionally we find one of our Barcelona nodes with
a SuperMicro H8DM8-2 motherboard powered off.  IPMI reports
it as powered down too.

No kernel panic, no crash, nothing in the system logs.

Nothing in the IPMI logs either, it's just sitting there
as if someone has yanked the power cable (and we're pretty
sure that's not the cause!).

There had not been any discernible pattern to the nodes
affected, and we've only a couple nodes where it's happened
twice, the rest only have had it happen once and scattered
over the 3 racks of the cluster.

For the longest time we had no way to reproduce it, but then
we noticed that for 3 of the power off's there was a particular
user running Fluent on there.   They've provided us with a copy
of their problem and we can (often) reproduce it now with that
problem.  Sometimes it'll take 30 minutes or so, sometimes it'll
take 4-5 hours, sometimes it'll take 3 days or so and sometimes
it won't do it at all.

It doesn't appear to be thermal issues as (a) there's nothing in
the IPMI logs about such problems and (b) we inject CPU and system
temperature into Ganglia and we don't see anything out of the
ordinary in those logs. :-(

We've tried other codes, including HPL, and Advanced Clustering's
Breakin PXE version, but haven't managed to (yet) get one of the
nodes to fail with anything except Fluent. :-(

The only oddity about Fluent is that it's the only code on
the system that uses HP-MPI, but we used the command line
switches to tell it to use the Intel MPI it ships with and
it did the same then too!

I just cannot understand what is special about Fluent,
or even how a user code could cause a node to just turn
off without a trace in the logs.

Obviously we're pursuing this through the local vendor
and (through them) SuperMicro, but to be honest we're
all pretty stumped by this.

Does anyone have any bright ideas ?

cheers,
Chris
--
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Security issues

2008-10-27 Thread Scott Atchley


On Oct 25, 2008, at 11:17 PM, Marian Marinov wrote:

Also a good security addition will be adding SELinux, RSBAC or  
GRSecurity to

the kernel and actually using any of these.


Bear in mind, that there may be performance trade-offs. Enabling  
SELinux will cut 2 Gb/s off a 10 Gb/s link as measured by netperf's  
TCP_STREAM.


I am not saying don't use SELinux, but simply be aware of what impact  
it will have.


Scott
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Has DDR IB gone the way of the Dodo?

2008-10-03 Thread Scott Atchley


On Oct 3, 2008, at 2:24 PM, Bill Broadley wrote:

QDR over fiber should be reasonably priced, here's hoping that the  
days of

Myrinet 250MB/sec optical cables will return.

Corrections/comments welcome.


I am not in sales and I have no access to pricing besides our list  
prices, but I am told that optical QSFP is getting very close to CX-4  
when you price NICs and cables (QSFP NICs cost less than CX-4 NICs,  
the cables are more, the total package is very close).


Scott
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] scratch File system for small cluster

2008-09-25 Thread Scott Atchley


On Sep 25, 2008, at 10:19 AM, Joe Landman wrote:

We have measured NFSoverRDMA speeds (on SDR IB at that) at 460 MB/s,  
on an RDMA adapter reporting 750 MB/s (in a 4x PCIe slot, so ~860 MB/ 
s max is what we should expect for this).  Faster IB hardware should  
result in better performance, though you still have to walk through  
the various software stacks, and they ... remove efficiency ...  
(nice PC way to say that they slow things down a bit :( )


Joe,

Even though recent kernels allow rsize and wsize of 1 MB for TCP,  
RPCRDMA only supports 32 KB. This will limit your throughput some  
regardless of faster hardware.


Scott
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Gigabit Ethernet and RDMA

2008-08-11 Thread Scott Atchley


Hi Gus,

Are you trying to find software for NICs you currently have? Or are  
you looking for gigabit Ethernet NICs that natively support some form  
of kernel-bypass/zero-copy?


I do not know of any of the latter (do Chelsio or others offer 1G NICs  
with iWarp?).


As for the former, there are several options for cluster use:

I believe Scyld has an optimized Ethernet stack. GAMMA has special  
drivers for certain Intel NICs. PM/Ethernet-HXB is under active  
development. Open-MX works over any Ethernet driver and works with any  
MPI that works with native MX.


If you are interested in 10G Ethernet, MX on Myricom 10G NICs can work  
with our Myrinet switches or any brand of Ethernet switches (some  
Ethernet switches provide lower latency than others).


Scott

On Aug 11, 2008, at 4:28 PM, Gus Correa wrote:


Hello Beowulf fans

Does anyone know the status of  RDMA on Gigabit Ethernet?

Is it a stable solution for a cluster interconnect, or still an  
experimental thing?


Is it effective in offloading network tasks from the CPU?
(Myrinet and Infiniband seem to use RDMA effectively, right?)

What does it take for it to work under typical Linux distributions?  
A driver?

A special kernel?
Something else? Just plug the NIC in and play?

Does it support standard MPICH2 and/or OpenMPI compiled out of the  
box,
or does it require linking to some type of special low level  
communication library,
or perhaps requires the use of a special flavor or MPI (say, from  
the NIC vendor)?


I poked around on the web,
and learned that Ammasso seems to have pioneered RDMA-enabled GigE  
NICs (Ammasso 1100).

Broadcom advertises a NIC with similar characteristics (BCM5706).
However, it is unclear if RDMA GigE NICs would work with standard  
Linux distros,
if it is effective, how much it costs, and how much hassle is  
required to make it work.


Thank you,
Gus Correa

--
-
Gustavo J. Ponce Correa, PhD - Email: [EMAIL PROTECTED]
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
-

___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Roadrunner picture

2008-07-16 Thread Scott Atchley


On Jul 16, 2008, at 6:50 PM, John Hearns wrote:


On Wed, 2008-07-16 at 23:29 +0100, John Hearns wrote:

To answer your question more directly, Panasas is a storage cluster  
to
complement your compute cluster. Each storage blade is connected  
into a

shelf (chassis) with an internal ethernet network. Each shelf is then
connected to your ethernet switch with at least 4Gbps of bandwidth.


Before I damn Panasas with faint praise, there is an option for higher
bandwidth connectivity which I'd hazard a guess is in use here.
And remember that's the connectivity to each one of the chassis -  
not to

the system as a whole.



They do. Here is more info:

http://www.byteandswitch.com/document.asp?doc_id=155938WT.svl=news1_2

Scott
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] automount on high ports

2008-07-02 Thread Scott Atchley


On Jul 2, 2008, at 7:22 AM, Carsten Aulbert wrote:


Bogdan Costescu wrote:


Have you considered using a parallel file system ?


We looked a bit into a few, but would love to get any input from  
anyone
on that. What we found so far was not really convincing, e.g.  
glusterFS
at that time was not really stable, lustre was too easy to crash -  
at l

east at that time, ...


Hi Carsten,

I have not looked at GlusterFS at all. I have worked with Lustre and  
PVFS2 (I wrote the shims to allow them to run on MX).


Although I believe Lustre's robustness is very good these days, I do  
not believe that it will not work in your setting. I think that they  
currently do not recommend mounting a client on a node that is also  
working as a server as you are doing with NFS. I believe it is due to  
memory contention leading to deadlock.


PVFS2 does, however, support your scenario where each node is a server  
and can be mounted locally as well. PVFS2 servers run in userspace and  
can be easily debugged. If you are using MPI-IO, it integrates nicely  
as well. Even so, keep in mind that using each node as a server will  
consume network resources and will compete with MPI communications.


Scott
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] automount on high ports

2008-07-02 Thread Scott Atchley


On Jul 2, 2008, at 10:09 AM, Gerry Creager wrote:

Although I believe Lustre's robustness is very good these days, I  
do not believe that it will not work in your setting. I think that  
they currently do not recommend mounting a client on a node that is  
also working as a server as you are doing with NFS. I believe it is  
due to memory contention leading to deadlock.


Lustre is good enough that it's the parallel FS at TACC for the  
Ranger cluster.  And, I've had no real problems as a user thereof.   
We're brining up glustre on our new cluster here ( flamebait  
CentOS/RHEL5, not debian /flamebait).  We looked at zfs but didn't  
have sufficient experience to go that path.


I believe that all the large DOE labs are using Lustre and would not  
if it were not reliable. My only concern was Carsten not having  
dedicated server nodes and mounting directly on those nodes.


I may be off-base and hopefully one of the Lustre/SUN people might  
correct me if so. :-)


Scott
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] How Can Microsoft's HPC Server Succeed?

2008-04-03 Thread Scott Atchley


On Apr 3, 2008, at 3:52 PM, Kyle Spaans wrote:
On Wed, Apr 2, 2008 at 7:39 PM, Chris Dagdigian [EMAIL PROTECTED]  
wrote:
spew out a terabyte per day of raw data and many times that stuff  
needs to
be post processed and distilled down into different forms. A nice  
little
8-core box running a shrink-wrap HPC product with a single support  
contact
could find a nice little niche in non-datacenter areas where  
significant
compute is needed nearby some other sort of dedicated instrument  
or device.

...
On Wed, Apr 2, 2008 at 6:44 PM, Greg Byshenk [EMAIL PROTECTED]  
wrote:

 a business will also have to find
 (and pay) someone to build and maintain the cluster.



Forgive me perhaps for being naive, but why can't a knowledgeable
teenager / college student be paid ~$10/hour plus on-call time to do a
setup like this? Presuming they only need to hire someone to do
setup/administration/support (and not the actual programming itself).


What if your data is sensitive or has HIPPA requirements? Do you want  
a part-timer having admin control of that data (regardless if it is  
Linux, Windows, or MacOSX)?


Scott
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Cheap SDR IB

2008-01-30 Thread Scott Atchley


On Jan 30, 2008, at 6:20 PM, Gilad Shainer wrote:

For BW, Lx provides ~1400MB/s, EX is ~1500MB/s and ConnectX is  
~1900MB/s

uni-directional on PCIe Gen2.

Feel free to contact me directly for more info.

Gilad.


My god, IB bandwidths always confuse me. :-)

I thought IB SDR was 10 GB/s signal rate and 8 Gb/s data rate. How do  
you squeeze ~1400 MB/s out of 8 Gb/s?


I see you offer Lx cards in PCIe 4x and 8x. Again, PCIe is encoded at  
10bit/8bit so the data rate is 8 Gb/s. So the above value is for your  
8x cards only, no? The thread is about your 4x cards, no?


Are there many PCIe Gen2 motherboards yet?

Scott
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Really efficient MPIs??

2007-11-28 Thread Scott Atchley


On Nov 28, 2007, at 8:49 AM, Charlie Peck wrote:


On Nov 28, 2007, at 8:04 AM, Jeffrey B. Layton wrote:

Unless you are using a gigabit ethernet, Open-MPI is noticeably  
less efficient that LAM-MPI over that fabric.


I suspect at some point in the future gige will catch-up but for  
now my (limited) understanding is that the Open-MPI folks are  
focusing their time on higher bandwidth/lower latency fabrics than  
gige.


charlie



At SC07 MPICH2 BoF, I gave a brief talk about MPICH2-MX. In addition  
to showing results of it running over MX-10G, I had a few slides  
showing performance using MPICH2-MX over Open-MX on Intel e1000  
drivers (80003ES2LAN NICs). IMB Pingpong latency is ~10 us for small  
messages and throughput for large messages is near line rate.


Open-MX provides the same API as MX, runs on _any_ Ethernet driver/ 
NIC, should work with most MX software (MPICH-MX, MPICH2-MX, PVFS2,  
etc.) and is open source. It has not reached a release tarball yet,  
but we expect to see one soon.


Scott
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Not quite Walmart, or, living without ECC?

2007-11-26 Thread Scott Atchley


On Nov 26, 2007, at 3:27 PM, David Mathog wrote:


I ran a little test over the Thanksgiving holiday to see how common
random errors in nonECC memory are.  I used the memtest86+ bit fade  
test

mode, which writes all 1s, waits 90 minutes, checks the result, then
does the same thing for all 0s.   Anyway, this was the best test I  
could

find for detecting the occasional gamma ray type data loss event.  The
result: no errors logged in 5 solid days of testing.  So this class of
error (the type ECC would detect and probably fix) apparently occurs
on these machines at a rate of less than 1 per 840 Gigabyte-hours.
Possibly the upper limit is half that if data can only be lost
on 1 - 0 transition, or vice versa.  This assumes the bit fade test
works, which cannot be independently verified from these results.

On the web there are references to an IBM study which found 1 bit
error/256Mb/Month, which would have been (.25 *30 * 24) =
1 per 180 Gigabyte-hours.  If IBM's numbers held for my hardware
there should have seen 4 or 5 errors in total.  Mine are in a basement
in a concrete building, perhaps that provided some shielding  
relative to

what IBM used for their test conditions.

The memory was Corsair Twinx1024-3200C2.  When first installed all
of this memory had run for 24 hours with no errors in normal
memtest86+ testing.

Regards,

David Mathog


Or maybe you got lucky. Five days may not be long enough.

We have had customers report events that included parity errors on  
hundreds of nodes simultaneously on large clusters. Higher altitude  
makes things worse. Being in a DOE lab near lots of interesting  
materials does not help either. :-)


Scott
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Problem with Single RAID disk larger than 2TB and Linux

2007-10-03 Thread Scott Atchley


Is someone using a signed int to represent the 1 KB blocks?

2 * 1024 * 1024 * 1024 * 1024 = 219902322

Scott

On Oct 3, 2007, at 7:29 AM, Anand Vaidya wrote:


Dear Beowulfers,

We ran into a problem with large disks which I suspect is fairly  
common, however the usual solutions are not working.  IBM, RedHat  
have not been able to provide any useful answers so I am turning to  
this list for help. (Emulex is still helping, but I am not sure how  
far they can go without access to the hardware)


Details:

* Linux Cluster for Weather modelling

*  IBM Bladecenter blades and an IBM x3655 Opteron head node FC  
attached to a Hitachi Tagmastore SAN storage, Emulex LightPulse FC  
HBA, PCI-Express, Dual port


* RHEL 4update5, x86_64 kernel 2.6.9-55 SMP and RHEL provided  
Emulex driver (lpfc) and lpfcdfc also installed


* GPT partition created with parted

There is one 2TB LUN, works fine.

There is a 3TB LUN on the Hitachi SAN which is reported as only  
2199GB ( 2.1TB) ,


We noticed that, when the emulex driver loads, the following error  
message is reported:


Emulex LightPulse Fibre Channel SCSI driver 8.0.16.32
Copyright(c) 2003-2007 Emulex.  All rights reserved.
ACPI: PCI Interrupt :2d:00.0[A] - GSI 18 (level,  
low) - IRQ 185

PCI: Setting latency timer of device :2d:00.0 to 64
lpfc :2d:00.0: 0:1305 Link Down Event x2 received  
Data: x2 x4 x1000
lpfc :2d:00.0: 0:1305 Link Down Event x2 received  
Data: x2 x4 x1000
lpfc :2d:00.0: 0:1303 Link Up Event x3 received  
Data: x3 x1 x10 x0
scsi5 : IBM 42C2071 4Gb 2-Port PCIe FC HBA for System x  
on PCI bus 2d device 00 irq 185 port 0

Vendor: HITACHI   Model: OPEN-V*3  Rev: 5007
Type:   Direct-Access  ANSI SCSI  
revision: 03

sdb : very big device. try to use READ CAPACITY(16).
sdb : READ CAPACITY(16) failed.
sdb : status=1, message=00, host=0, driver=08
sdb : use 0x as device size
SCSI device sdb: 4294967296 512-byte hdwr sectors  
(2199023 MB)

SCSI device sdb: drive cache: write back
sdb : very big device. try to use READ CAPACITY(16).
sdb : READ CAPACITY(16) failed.
sdb : status=1, message=00, host=0, driver=08
sdb : use 0x as device size
SCSI device sdb: 4294967296 512-byte hdwr sectors  
(2199023 MB)

SCSI device sdb: drive cache: write back

The problem is with the READ CAPACITY(16) failed, but we are unable  
to find the source of this error.


We conducted several experiments without success:

- Tried compiling the latest driver from Emulex (8.0.16.32) - same  
error
- Tried Knoppix (2.6.19) and Gentoo LiveCD (2.6.19 ) , and CentOS  
4.4   - same error
- Tried to boot Belenix (Solaris 32 bit live), failed to boot  
completely (may be unrelated issue)


We have a temporary workaround in place: We created 3x1TB disks and  
used LVM to create a striped 3TB  volume with ext3 FS. This works  
fine.


RedHat claims ext3 and RHEL4  supports disks upto 8TB and 16TB  
respectively (since RHEL4u2)


I would like to know if anyone on the list has any pointers that  
can help us solve the issue.


Regards
Anand Vaidya



___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit  
http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Passwordless ssh - strange problem

2007-09-14 Thread Scott Atchley

On Sep 14, 2007, at 1:14 PM, [EMAIL PROTECTED]  
[EMAIL PROTECTED] wrote:



Checked that - it's 700.


On original host (ssh from) and target (ssh to), tun:

$ ls -al ~/.ssh

also, try:

$ ssh -vvv target

Please post back with results and contents of /etc/ssh/*config.

Scott 
___

Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] MPI2007 out - strange pop2 results?

2007-07-21 Thread Scott Atchley


Hi Gilad,

Presentation at ISC? I did not attend this year and, while I did last  
year, I did not give any presentations. I simply talked to customers  
in our booth and walked the floor. I even stopped by the Mellanox  
booth and chatted awhile. :-)


Scott

On Jul 20, 2007, at 9:31 PM, Gilad Shainer wrote:


Hi Scot,

I always try to mention exactly what I am comparing to, and not making
it what it is not. And in most cases, I use the exact same platform  
and

mention the details. This makes the information much more credible,
don't you agree?

By the way, in the presentation you had at ISC, you did exactly the  
same
as what my dear friends from Qlogic did... sorry, I could not  
resist...


G

-Original Message-
From: Scott Atchley [mailto:[EMAIL PROTECTED]
Sent: Friday, July 20, 2007 6:21 PM
To: Gilad Shainer
Cc: Kevin Ball; beowulf@beowulf.org
Subject: Re: [Beowulf] MPI2007 out - strange pop2 results?

Gilad,

And you would never compare your products against our deprecated  
drivers

and five year old hardware. ;-)

Sorry, couldn't resist. My colleagues are rolling their eyes...

Scot

On Jul 20, 2007, at 2:55 PM, Gilad Shainer wrote:


Hi Kevin,

I believe that your company is using this list for pure marketing  
wars



for a long time, so don't be surprise when someone responds back.

If you want to put technical or performance data, and than to make
conclusions out of it, be sure to compare apples to apples. It is  
easy



use the lower performance device results of your competitor and than
to attack his architecture or his entire product line. If this is
not a marketing war, than I would be interesting to know what you  
call



a marketing war

G


-Original Message-
From: Kevin Ball [mailto:[EMAIL PROTECTED]
Sent: Friday, July 20, 2007 11:27 AM
To: Gilad Shainer
Cc: Brian Dobbins; beowulf@beowulf.org
Subject: RE: [Beowulf] MPI2007 out - strange pop2 results?

Hi Gilad,

  Thank you for the personal attack that came, apparently without  
even



reading the email I sent.  Brian asked about why the publicly
available, independently run MPI2007 results from HP were worse on a
particular than the Cambridge cluster MPI2007 results.  I talked  
about



three contributing factors to that.  If you have other reasons you
want to put forward, please do so based on data, rather than engaging
in a blatant ad hominem attack.

  If you want to engage in a marketing war, there are venues with
which to do it, but I think on the Beowulf mailing list data and
coherent thought are probably more appropriate.

-Kevin

On Fri, 2007-07-20 at 10:43, Gilad Shainer wrote:

Dear Kevin,

You continue to set world records in providing misleading
information.
You had previously compared Mellanox based products on dual
single-core machines to the InfiniPath adapter on dual dual-core
machines and claim that with InfiniPath there are more Gflops
This



latest release follow the same lines...

Unlike QLogic InfiniPath adapters, Mellanox provide different
InfiniBand HCA silicon and adapters. There are 4 different silicon
chips, each with different size, different power, different price  
and



different performance. There is the PCI-X device (InfiniHost), the
single-port device that was deigned for best price/performance
(InfiniHost III Lx), the dual-port device that was designed for best
performance (InfiniHost III Ex) and the new ConnectX device that was
designed to extend the performance capabilities of the dual port
device. Each device provide different price and performance points

(did I said different?).


The SPEC results that you are using for Mellanox, are of the single
port device. And even that device (that its list price is probably
half of your InfiniPath) had better results with  8 server nodes  
than

yours

Your comparison of InfiniPath to the Mellanox single-port device
should have been on price/performance and not on performance.  
Now, if



you want to really compare performance to performance, why don't you
use the dual port device, or even better, ConnectX? Well... I  
will do

it for you.

Every time I had compared my performance adapters to yours, your
adapters did not even come close...


Gilad.

-Original Message-
From: [EMAIL PROTECTED] [mailto:beowulf-
[EMAIL PROTECTED] On Behalf Of Kevin Ball
Sent: Thursday, July 19, 2007 11:52 AM
To: Brian Dobbins
Cc: beowulf@beowulf.org
Subject: Re: [Beowulf] MPI2007 out - strange pop2 results?

Hi Brian,

   The benchmark 121.pop2 is based on a code that was already
important to QLogic customers before the SPEC MPI2007 suite was
released (POP, Parallel Ocean Program), and we have done a fair
amount



of analysis trying to understand its performance characteristics.
There are three things that stand out in performance analysis on
pop2.

  The first point is that it is a very demanding code on the
compiler.



There has been a fair amount of work on pop2 by the PathScale
compiler



team, and the fact

Re: [Beowulf] MPI2007 out - strange pop2 results?

2007-07-20 Thread Scott Atchley


Gilad,

And you would never compare your products against our deprecated  
drivers and five year old hardware. ;-)


Sorry, couldn't resist. My colleagues are rolling their eyes...

Scot

On Jul 20, 2007, at 2:55 PM, Gilad Shainer wrote:


Hi Kevin,

I believe that your company is using this list for pure marketing wars
for a long time, so don't be surprise when someone responds back.

If you want to put technical or performance data, and than to make
conclusions out of it, be sure to compare apples to apples. It is easy
use the lower performance device results of your competitor and  
than to

attack his architecture or his entire product line. If this is not a
marketing war, than I would be interesting to know what you call a
marketing war

G


-Original Message-
From: Kevin Ball [mailto:[EMAIL PROTECTED]
Sent: Friday, July 20, 2007 11:27 AM
To: Gilad Shainer
Cc: Brian Dobbins; beowulf@beowulf.org
Subject: RE: [Beowulf] MPI2007 out - strange pop2 results?

Hi Gilad,

  Thank you for the personal attack that came, apparently without even
reading the email I sent.  Brian asked about why the publicly  
available,

independently run MPI2007 results from HP were worse on a particular
than the Cambridge cluster MPI2007 results.  I talked about three
contributing factors to that.  If you have other reasons you want  
to put

forward, please do so based on data, rather than engaging in a blatant
ad hominem attack.

  If you want to engage in a marketing war, there are venues with  
which

to do it, but I think on the Beowulf mailing list data and coherent
thought are probably more appropriate.

-Kevin

On Fri, 2007-07-20 at 10:43, Gilad Shainer wrote:

Dear Kevin,

You continue to set world records in providing misleading  
information.

You had previously compared Mellanox based products on dual
single-core machines to the InfiniPath adapter on dual dual-core
machines and claim that with InfiniPath there are more Gflops  
This



latest release follow the same lines...

Unlike QLogic InfiniPath adapters, Mellanox provide different
InfiniBand HCA silicon and adapters. There are 4 different silicon
chips, each with different size, different power, different price and
different performance. There is the PCI-X device (InfiniHost), the
single-port device that was deigned for best price/performance
(InfiniHost III Lx), the dual-port device that was designed for best
performance (InfiniHost III Ex) and the new ConnectX device that was
designed to extend the performance capabilities of the dual port
device. Each device provide different price and performance points

(did I said different?).


The SPEC results that you are using for Mellanox, are of the single
port device. And even that device (that its list price is probably
half of your InfiniPath) had better results with  8 server nodes than

yours

Your comparison of InfiniPath to the Mellanox single-port device
should have been on price/performance and not on performance. Now, if
you want to really compare performance to performance, why don't you
use the dual port device, or even better, ConnectX? Well... I will do

it for you.

Every time I had compared my performance adapters to yours, your
adapters did not even come close...


Gilad.

-Original Message-
From: [EMAIL PROTECTED] [mailto:beowulf- 
[EMAIL PROTECTED]

On Behalf Of Kevin Ball
Sent: Thursday, July 19, 2007 11:52 AM
To: Brian Dobbins
Cc: beowulf@beowulf.org
Subject: Re: [Beowulf] MPI2007 out - strange pop2 results?

Hi Brian,

   The benchmark 121.pop2 is based on a code that was already
important to QLogic customers before the SPEC MPI2007 suite was
released (POP, Parallel Ocean Program), and we have done a fair  
amount



of analysis trying to understand its performance characteristics.
There are three things that stand out in performance analysis on  
pop2.


  The first point is that it is a very demanding code on the  
compiler.


There has been a fair amount of work on pop2 by the PathScale  
compiler



team, and the fact that the Cambridge submission used the PathScale
compiler while the HP submission used the Intel compiler accounts for
some (the serial portion) of the advantage at small core counts,
though scalability should not be affected by this.

  The second point is that pop2 is fairly demanding of IO.  Another
example to look at for this is in comparing the AMD Emerald Cluster
results to the Cambridge results;  the Emerald cluster is using NFS
over GigE from a single server/disk, while Cambridge has a much more
optimized IO subsystem.  While on some results Emerald scales better,
for pop2 it scales only from 3.71 to 15.0 (4.04X) while Cambridge
scales from 4.29 to 21.0 (4.90X).  The HP system appears to be using
NFS over DDR IB from a single server with a RAID;  thus it should  
fall



somewhere between Emerald and Cambridge in this regard.

  The first two points account for some of the difference, but by no
means all.  The final one is probably the

Re: [Beowulf] Performance characterising a HPC application

2007-03-21 Thread Scott Atchley


On Mar 21, 2007, at 12:05 AM, Mark Hahn wrote:

if the net is a bandwidth bottleneck, then you'd see lots of back- 
to-back
packets, adding up to near wire-speed.  if latency is the issue,  
you'll see
relatively long delays between request and response (in NFS, for  
instance).
my real point is simply that tcpdump allows you to see the  
unadorned truth
about what's going on.  obviously, tcpdump will let you see the  
rate and scale of your flows, and between which nodes...


anything which doesn't speed up going from gigabit to IB/10G/ 
quadrics is what I would call embarassingly parallel...


True - I guess I'm trying to do some cost/benefit analysis so the  
magnitude of the improvement is important to me .. but maybe  
measuring it on a test cluster is the only way to be sure of this  
one.


well, maybe.  it's a bit jump from 1x Gb to IB or 10GE - I wish it  
were easier to advocate Myri 2G as an intermediate step, since I  
actually don't see a lot of apps showing signs of dissatisfaction  
with ~250 MB/s interconnect - and IB/10GE don't have much  
advantage, if any, in latency.


Mark,

I have not benchmarked any applications that need more than 250 MB/s  
during computation, although I know someone at ORNL that could get  
close to 125 MB/s on the X1e (which doesn't use a Myricom fabric).  
Where 10G comes in to play is data movement. You can get ~700-900 MB/ 
s with IB SDR, ~1,200 MB/s with Myri-10G (Ethernet or MX), and ~1,400  
MB/s using IB DDR with Lustre, for example.


There is little difference between latency for Myrinet-2000 E and F  
cards and Myri-10G for small messages. Once messages start to go over  
1 KB, then the extra bandwidth helps. As always, profile your code.


product plug
If you are using MX, we have added some optional statistics available  
with our debug library that will give you the counts for each size  
class of message at the completion of the run. In addition to a  
profiler like pMPI, it can help you determine if your app is more  
latency or bandwidth sensitive.

/product plug

Scott
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

1 2 >

1 - 100 of 107 matches

Mail list logo