Re: [hwloc-users] Netloc feature suggestion

2019-08-16 Thread Jeff Squyres (jsquyres) via hwloc-users
Don't forget that network topologies can also be complex -- it's not always a 
simple, single-path hierarchy.  There can be multiple paths between any pair of 
hosts on the network.  Sometimes the hosts are aware of the multiple paths, 
sometimes they are not (e.g., sometimes the fabric routing changes during the 
course of a single MPI job, and the hosts/MPI applications are unaware).

Meaning: the information about which network paths are taken for a given 
host-A-to-host-B traversal may be both distributed and transient.


On Aug 14, 2019, at 11:05 AM, Rigel Falcao do Couto Alves 
mailto:rigel.al...@tu-dresden.de>> wrote:

Hi,

I am doing a PhD in performance analysis of highly parallel CFD codes and would 
like to suggest a feature for Netloc: from topic Build Scotch sub-architectures 
(at https://www.open-mpi.org/projects/hwloc/doc/v2.0.3/a00329.php), create a 
function-version of netloc_get_resources, which could retrieve at runtime the 
network details of the available cluster resources (i.e. the nodes allocated to 
the job). I am mostly interested about how many switches (the gray circles in 
the figure below) need to be traversed in order for any pair of allocated nodes 
to communicate with each other:



For example, suppose my job is running within 4 nodes in the cluster, 
illustrated by the numbers above. All I would love to get from Netloc - at 
runtime - is some sort of classification of the nodes, like:

1: aa
2: ab
3: ba
4: ca

The difference between nodes 1 and 2 is on the last digit, which means their 
MPI communications only need to traverse 1 switch; however, between any of them 
and nodes 3 or 4, the difference starts on the second-last digit, which means 
their communications need to traverse two switches. More digits may be 
left-added to the string, per necessity; i.e. if the central gray circle on the 
above figure is connected to another switch, which in turnleads to another part 
of the cluster's structure (with its own switches, nodes etc.). For me, it is 
at the present moment irrelevant whether e.g. nodes 1 and 2 are physically - or 
logically - consecutive to each other: a, b, c etc. would be just arbitrary 
identifiers.

I would then use this data to plot the process placement, using open-source 
tools developed here in the University of Dresden (Germany); i.e. Scotch is not 
an option for me. The results of my study will be open-source as well and I can 
gladly share them with you once the thesis is finished.

I hope I have clearly explained what I have in mind; please let me know if 
there are any questions. Finally, it is important that this feature is part of 
Netloc's API (as it is supposed to be integrated with the tools we develop 
here), works at runtime and doesn't require root privileges (as those tools are 
used by our cluster's costumers on their every-day job submissions).

Kind regards,


--
Dipl.-Ing. Rigel Alves
researcher

Technische Universität Dresden
Center for Information Services and High Performance Computing (ZIH)
Zellescher Weg 12 A 218, 01069 Dresden | Germany

�� +49 (351) 463.42418
�� https://tu-dresden.de/zih/die-einrichtung/struktur/rigel-alves


___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org<mailto:hwloc-users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>



___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Hwloc command not working

2017-03-02 Thread Jeff Squyres (jsquyres)
Jeyaraj --

I think what we need is a bit more specific information in order to help you.  
Everyone's system is setup differently; we don't know how yours is setup.  For 
example:

- What version of hwloc did you install?
- Where did you get the RPM for hwloc?
- How exactly are you testing?
- You mentioned that you installed an hwloc RPM -- did you look at the RPM 
contents to see if it includes the hwloc-dump-hwdata command?  (e.g., "rpm -ql 
hwloc")
- If you're using hwloc from that RPM and something is not there that you 
expect to be there, you might want to contact the maintainer of that RPM (e.g., 
look at "rpm -qi hwloc")
- ...etc.

When you check the contents of the hwloc RPM, you might notice that 
hwloc-dump-hwdata might well be installed in the sbin directory, not the bin 
directory.  A wild guess: perhaps your PATH does not include the sbin 
directory...?


> On Mar 2, 2017, at 8:25 AM, Marco Atzeri  wrote:
> 
> On 02/03/2017 11:56, jeyaraj wrote:
>> Hi,
>> All rpm file and tar file also installed. Not working
>> 
> 
> a bit vague, and we have no crystal ball to look on your system.
> 
> Are you sure the package is properly installed ?
> How you did the verification ?
> Should you not ask on the mailing list for help of your system,
> what ever it is, instead of asking here ?
> 
> 
> 
> 
> 
> 
> 
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


-- 
Jeff Squyres
jsquy...@cisco.com

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


[hwloc-users] Open MPI mail archives now back online

2016-08-12 Thread Jeff Squyres (jsquyres)
mail-archive.com now has all of the old Open MPI mail archives online.  Example:

   https://www.mail-archive.com/users@lists.open-mpi.org/
   https://www.mail-archive.com/devel@lists.open-mpi.org/

Note that there are two different ways you can permalink to messages on 
mail-archive:

1. Take the "main" URL of the message (i.e., the one shown in the address bar 
when you're viewing a message) -- e.g.

   https://www.mail-archive.com/users@lists.open-mpi.org/msg28978.html

2. Use the message ID (which uniquely identifies a message) in the form:

   http://mid.mail-archive.com/MESSAGE_ID
   NOTE: http, not https!

e.g.

   
http://mid.mail-archive.com/1233038409.12589.1460577425350.JavaMail.yahoo@mail.yahoo.com

The index on each of the mailing lists is a sliding window that only lasts for 
a few thousand messages, but *all* messages are available (even if they're not 
listed on the index pages):

- via their permalinks
- via Google search (give Google a little while to finish indexing all the new 
messages we recently uploaded to mail-archive.com)
- via the mail-archive.com web site search box

Finally, all of the old Open MPI mail archives are still available under 
https://www.open-mpi.org/community/lists/ (so that we don't break lots of old 
links from around the web), but they are frozen.  No new messages have been 
added to the frozen archive since late July 2016 or so.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
devel mailing list
de...@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


[hwloc-users] Mailing list migration: status

2016-07-27 Thread Jeff Squyres (jsquyres)
We have transitioned the Open MPI mailing lists to our new best friends at the 
New Mexico Consortium (http://newmexicoconsortium.org/).  Thank you, NMC!

For at least a little while, you'll see newmexicoconsortium.org in the footers 
of our mailing list mails.  Eventually, we hope to replace those with 
lists.open-mpi.org URLs.  We also know that Gmail users will see a red broken 
lock indicating that Open MPI mails were not encrypted in transit.  We'll 
hopefully be able to upgrade this over time.

We have also transitioned the web archives of Open MPI mailing lists to The 
Mail Archive (http://mail-archive.com):

1. New mails will start showing up there as they are sent across the lists.  
For example, the Hwloc "users" list will show up under 
https://www.mail-archive.com/hwloc-users@lists.open-mpi.org/.

2. Sometime soon, all the archives of all Open MPI lists will be loaded at 
mail-archive.com, so everything will remain Google-able.

3. The old web mail archives (under https://www.open-mpi.org/community/lists/) 
are now frozen.  They will continue to be available so that old links in code, 
comments, tickets, pull requests, ...etc. will continue to work).  We will 
eventually remove links to this archive from the Open MPI web site, and replace 
them with links to the new Mail Archive.  Hopefully, the old archive will 
eventually disappear from Google, and will effectively be replaced by the new 
copies of the same mails at mail-archive.com.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] This list is suspended while migrating

2016-07-27 Thread Jeff Squyres (jsquyres)
...and we're back.

NOTE: the email address for this list has now changed!  It is now 
@lists.open-mpi.org (it used to be @open-mpi.org).

PLEASE UPDATE YOUR CONTACTS AND MAIL CLIENT FILTERS!


> On Jul 27, 2016, at 12:01 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
> We are beginning the list migration process; this list will be suspended 
> while it is in transit to a new home.
> 
> We can't predict the exact timing of the migration -- hopefully it'll only 
> take a few hours.
> 
> See you on the other side!
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] This list is suspended while migrating

2016-07-21 Thread Jeff Squyres (jsquyres)
We unfortunately ran into major issues while trying to migrate these lists, and 
have therefore restored them back on the Indiana U servers until we try the 
migration again.

Sorry for the hassle folks; stay tuned!


> On Jul 20, 2016, at 10:28 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> We are beginning the list migration process; this list will be suspended 
> while it is in transit to a new home.
> 
> We can't predict the exact timing of the migration -- hopefully it'll only 
> take a few hours.
> 
> See you on the other side!
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[hwloc-users] This list is suspended while migrating

2016-07-20 Thread Jeff Squyres (jsquyres)
We are beginning the list migration process; this list will be suspended while 
it is in transit to a new home.

We can't predict the exact timing of the migration -- hopefully it'll only take 
a few hours.

See you on the other side!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[hwloc-users] This list is migrating!

2016-07-19 Thread Jeff Squyres (jsquyres)
Short version
=

The server for this mailing list will be migrating sometime soon (the exact 
timing is not fully predictable).  Three things you need to know:

1. We'll send a "This list is now closed for migration" last message when the 
migration starts
2. We'll send a "This list is now open again" first message when the migration 
completes
3. The list email address will move from @open-mpi.org to @lists.open-mpi.org

More detail
===

The Open MPI hosting infrastructure is slowly moving away from its home of 10+ 
years: our gracious hosts at Indiana University (thank you for all the help and 
support, IU!).  The next pieces to migrate are the Open MPI project mailing 
lists (including this one).

The exact timing of the migration depends on our new hosting provider vendor; 
it's quite difficult to give an exact timeline.  The procedure will generally 
be something like this:

1. Send the final "This list is now closed!" email across this list
2. Shut off all incoming mail to the list
3. Shut down the web pages that allow users to make changes to the list
4. Bundle up all the list data and send it to our new hosting provider
5. Work with the provider to get the new lists online
6. Send a "This list is now open again!" email across the list

As noted above, we're changing the hostnames on the mailing lists to 
@lists.open-mpi.org so that we can de-couple the mailing lists from the rest of 
the web hosting infrastructure.  Please update your addressbook and mail 
filters appropriately.

Webified archives of the mailing lists will continue to be available:

1. Once the migration completes, the existing web archives (under, for example, 
https://www.open-mpi.org/community/lists/users/) will continue to be available, 
but they'll be frozen -- no new messages will be added there.  Specifically: 
links to old posts will continue to work.
2. New web archives for all the lists -- to include all the old posts -- will 
become available elsewhere.  Specifics will be included in the "The list is now 
open again!" mail.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [hwloc-users] hello world can't run in Ubuntu 12.04

2015-04-15 Thread Jeff Squyres (jsquyres)
+1

Please try upgrading to Open MPI v1.8.x and see if that solves your problem.


> On Apr 15, 2015, at 12:06 AM, Christopher Samuel  
> wrote:
> 
> On 15/04/15 12:19, Li Li wrote:
> 
>>I am installed openmpi 1.5 and test it with a simple program
> 
> Umm, Open-MPI 1.5 is ancient!
> 
> Open-MPI 1.8.x is the current stable release branch, 1.6 was the
> previous stable release branch (we're still on that here).
> 
> 1.5 was the old feature branch that led up to the 1.6 stable series.
> 
> All the best,
> Chris
> -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-users/2015/04/1163.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [hwloc-users] Selecting real cores vs HT cores

2014-12-13 Thread Jeff Squyres (jsquyres)
Check out, in particular, the section "Single-task and multi-task modes".




On Dec 11, 2014, at 11:06 PM, Jeff Hammond  wrote:

> Can someone post docs for this resource halving Squyres claims? I've never 
> heard of this.
>
> Jeff
>
> On Thursday, December 11, 2014, Samuel Thibault  
> wrote:
> Jeff Squyres (jsquyres), le Thu 11 Dec 2014 21:12:27 +, a écrit :
> > When the BIOS is set to enable hyper threading, then several resources on 
> > the core are split when the machine is booted up (e.g., some of the queue 
> > depths for various processing units in the core are half the length that 
> > they are when hyperthreading is disabled in the BIOS).
>
> Perhaps some queues get divided, but most of the resources (such as
> cache, TLB, etc.) are completely available when using only one
> hyperthread, like they would be with HT disabled.
>
> Samuel
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-users/2014/12/1132.php
>
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-users/2014/12/1134.php


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Hyperthreading.pdf
Description: Hyperthreading.pdf


Re: [hwloc-users] Selecting real cores vs HT cores

2014-12-11 Thread Jeff Squyres (jsquyres)
On Dec 11, 2014, at 2:03 PM, Brice Goglin  wrote:

> By the way, if you can't in the BIOS, you may want to disable the
> hyperthread in the kernel:
> 
> for i in $(hwloc-calc --whole-system --po -I pu core:all.pu:0) ; do echo 0 > 
> /sys/devices/system/cpu/cpu$i/online ; done
> 
> (write 1 instead of 0 to reenable them).

But keep in mind that this is the semantic equivalent of using hwloc-bind to 
bind to the first HT in each core.

I.e., disabling HT in the Linux kernel just disables scheduling on the 2nd HT.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [hwloc-users] Selecting real cores vs HT cores

2014-12-11 Thread Jeff Squyres (jsquyres)
On Dec 11, 2014, at 1:36 PM, Brock Palen  wrote:

> Ok let me expand then.  I don't have control over the bios.
> 
> The testing I am doing resides on a cloud provider and from our testing it 
> appears that it has HT enabled.  It is ambiguous though to me what I see vs 
> how they allocate on their hypervisor. 

Oh, if you're in a hypervisor, then what you're seeing has zero correlation to 
reality.

If it's an HPC cloud provider, they *likely* paired cores in the hypervisor 
with real/physical cores.  More specifically: they *probably* paired hyper 
threads in the hypervisor with real/physical hyper threads (i.e., so that the 
lstopo in the hypervisor is equivalent to lstopo outside the hypervisor).

But you'll need to ask them, because modern VMs let you do whatever you want in 
terms of mapping VM cores/HTs to physical cores/HTs.

Consider: you can run dozens on web server VMs on a machine with 10 cores.  
Each VM will say that it has, say, 1 or 2 cores.  But clearly, the sum of 
number of cores in the VMs is larger than the total number of physical cores.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [hwloc-users] Selecting real cores vs HT cores

2014-12-11 Thread Jeff Squyres (jsquyres)
I'm not sure you're asking a well-formed question.

When the BIOS is set to enable hyper threading, then several resources on the 
core are split when the machine is booted up (e.g., some of the queue depths 
for various processing units in the core are half the length that they are when 
hyperthreading is disabled in the BIOS).

Hence, running a process on a core that only uses a single hyperthread (when HT 
is enabled) is not quite the same thing as booting up with HT disabled and 
running that same job on the core.

Make sense?

Meaning: if you want to test HT vs. non-HT performance, you really need to 
change the BIOS settings and reboot, sorry.

Also, note that if you have HT enabled and you run a single-threaded app bound 
to a core, it will only use 1 of those HTs -- the other HT will be largely 
dormant. Meaning: don't expect that running a single-threaded app on a core 
that has HT enabled will magically take advantage of some performance benefit 
of aggressive automatic parallelization.  You really need multiple threads in a 
process to get performance advantages out of HT.



On Dec 11, 2014, at 12:51 PM, Brock Palen  wrote:

> When a system has HT enabled is one core presented the real one and one the 
> fake partner?  Or is that not the case?
> 
> If wanting to test behavior without messing with the bios how do I select 
> just the 'real cores'  if this is the case?   
> 
> I am looking for the equivelent of 
> 
> hwloc-bind ALLREALCORES  my.exe
> 
> Doing some performance study type things.
> 
> Thanks,
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-users/2014/12/1126.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [hwloc-users] Using hwloc to detect Hard Disks

2014-09-22 Thread Jeff Squyres (jsquyres)
On Aug 28, 2014, at 7:27 PM, Samuel Thibault  wrote:

>> I am not able to figure out how to read Hard drive details, for e.g.,
>> the content provided by hdparm application.
>> 
>> My first question is, is it possible to read this using hwloc? If yes, can
>> anyone direct me to the documentation which describes how to use it?
> 
> Well, hwloc's goal is to describe the hardware _locality_, not its
> precise content.  So we don't provide that level of detail, we only
> provide where the pieces of hardware reside.

Can you be a bit more specific about what information you want to query?

I ask because it strikes me that hwloc does gather some kinds of hardware 
information and put them as attributes on existing hwloc topology objects.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [hwloc-users] CPU info on ARM

2014-01-28 Thread Jeff Squyres (jsquyres)
I passed this on to my OMPI ARM contact (Leif Lindholm).  Here's what he said:

   "It gets a bit trickier on ARM... since we may also have (implementation
time) configurable cache sizes and also big.LITTLE (different processor
models executing in the same SMP system)."

He passed the question on to another ARM guy, asking for further detail.  I'll 
pass on what he says.



On Jan 28, 2014, at 3:39 AM, Brice Goglin  wrote:

> Hello,
> 
> Is anybody familiar with ARM CPUs?
> 
> I am adding more CPU information because Intel needs more:
> CPUVendor=GenuineIntel
> CPUModel=Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
> CPUModelNumber=45
> CPUFamilyNumber=6
> 
> Would something similar be useful for ARM? What are the fields below
> from /proc/cpuinfo on ARM that would be useful to developers?
> Processor: Marvell PJ4Bv7 Processor rev 1 (v7l)
> BogoMIPS: 1196.85
> Features: swp half thumb fastmult vfp edsp vfpv3 vfpv3d16 tls
> CPU implementer: 0x56
> CPU architecture: 7
> CPU variant: 0x1
> CPU part: 0x581
> CPU revision: 1
> Hardware: Marvell Armada-370
> Revision: 
> Serial: 
> 
> thanks
> Brice
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[hwloc-users] hwloc problem on SGI machine

2014-01-10 Thread Jeff Squyres (jsquyres)
Jeff Becker (CC'ed) reported to me a failure with hwloc 1.7.2 (in OMPI trunk).  
I had him verify this with a standalone hwloc 1.7.2, and then had him try 
standalone hwloc 1.8 as well -- all got the same failure.

Here's what he's seeing in 1.7.2:

$ lstopo
Different OS indexes
lstopo: topology-linux.c:2731: look_sysfsnode: Assertion `node == res_obj' 
failed.
Aborted (core dumped)

In 1.8, the issue is the same, but a different line number (2741).

It's an SGI x86_64 server, running SLES 11.

Is this an hwloc issue, or a hardware issue?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [hwloc-users] [WARNING: A/V UNSCANNABLE]Re: [OMPI users] SIGSEGV in opal_hwlock152_hwlock_bitmap_or.A // Bug in 'hwlock" ?

2013-11-04 Thread Jeff Squyres (jsquyres)
You should be able to grab an Open MPI 1.7.x nightly tarball, and it should 
have the newer hwloc that fixes this issue.

Can you give it a whirl and see it works for you?


On Nov 4, 2013, at 1:49 PM, Brice Goglin  wrote:

> Thanks. That's indeed the same bug that you got in Open MPI (reuse of a
> hwloc cpuset structure that was freed earlier). It's a nasty bug that
> happens when reloading from XML on big machines like yours (that
> explains why lstopo works while xmlbuffer and OMPI fail). It was fixed
> in hwloc v1.7.1 (hence will be fixed in Open MPI 1.7.4 from what I
> understand) but the fix was too big to be backported to older hwloc/OMPI.
> 
> You should be able to work around the problem for now by setting
> HWLOC_GROUPING=0 in your environment.
> 
> I re-added hwloc-users to CC so that the bug is officially "closed".
> 
> Brice
> 
> 
> 
> 
> Le 04/11/2013 22:33, Paul Kapinos a écrit :
>> Hello again,
>> I'm not allowed to publish to Hardware locality user list so I omit it
>> now.
>> 
>> On 11/04/13 14:19, Brice Goglin wrote:
>>> Le 04/11/2013 11:44, Paul Kapinos a écrit :
>>>> Hello all,
>>>> I.
>>>> sorry for this paleontologic excursion. (The 4 years old 'lstopo'
>>>> binary was just in my private bin folder and still being runnable..)
>>>> 
>>>> Attached output of newer version 1.5 (Linux-Default one on RHEL/6.4
>>>> (SL/6.4).
>>>> 
>>>> II.
>>>> I've also tested hwloc-1.5.2 (could not find v.1.5.3) and hwloc-1.7.2
>>>> as Brice suggested, by 'confugure' + 'make test' - logs attached.
>>>> 
>>>> 1.5.2 fails:
>>>>> /bin/sh: line 5: 20677 Segmentation fault (core dumped) ${dir}$tst
>>>>> FAIL: xmlbuffer
>>> 
>>> Can you give more details about this segfault?
>>> 
>>> Try (from the build tree):
>>> $ libtool --mode=execute gdb xmlbuffer
>>> then type 'run'
>>> when it crashes, type 'bt full' and send the output.
>> 
>> see attached file trace_1.5.2.txt
>> 
>> 
>> 
>> 
>> 
>>> 
>>> Then please also run from hwloc 1.5.2:
>>> * "lstopo foo.xml" and send "foo.xml"
>>> * "hwloc-gather-topology foo" and send "foo.tar.bz2"
>> 
>> also attached but with non-empty names :o)
>> 
>> 
>> 
>> Best
>> 
>> Paul
>>> 
>>>> whereby 1.7.2 seem to be OK.
>>>> 
>>>> AFAIK in OpenMPI 1.7.4 the version of 'hwlock' has to be updated?
>>>> If so, the original issue should be fixed by this, huh?
>>> 
>>> Hard to say before we get details about the crash in xmlbuffer above.
>>> 
>>> Brice
>>> 
>>> 
>>>> 
>>>> Many thanks for your help!
>>>> Best
>>>> 
>>>> Paul
>>>> 
>>>> pk224850@linuxitvc00:~/SVN/mpifasttest/trunk[511]lstopo 1.5
>>>> $ lstopo lstopo_linuxitvc00_1.5.txt
>>>> $ lstopo lstopo_linuxitvc00_1.5.xml
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 11/01/13 15:37, Brice Goglin wrote:
>>>>> Sorry, I missed the mail on OMPI-users.
>>>>> 
>>>>> This hwloc looks vry old. We don't have Misc objects
>>>>> instead of
>>>>> Groups since we switched from 0.9 to 1.0. You should regenerate the
>>>>> XML file
>>>>> with a hwloc version that came out after the big bang (or better,
>>>>> after the
>>>>> asteroid killed the dinosaurs). Please resend that XML from a recent
>>>>> hwloc so
>>>>> that we can get a better clue of the problem.
>>>>> 
>>>>> Assuming there's a bug in OMPI's hwloc, I would suggests downloading
>>>>> hwloc 1.5.3
>>>>> and running make check on that machine. And try again with hwloc
>>>>> 1.7.2 in case
>>>>> that's already fixed.
>>>>> 
>>>>> thanks
>>>>> Brice
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Le 01/11/2013 15:24, Jeff Squyres (jsquyres) a écrit :
>>>>>> Paul Kapinos originally reported this issue on the OMPI users list.
>>>>>> 
>>>>>> 

Re: [hwloc-users] [OMPI users] SIGSEGV in opal_hwlock152_hwlock_bitmap_or.A // Bug in 'hwlock" ?

2013-11-01 Thread Jeff Squyres (jsquyres)
Paul Kapinos originally reported this issue on the OMPI users list.

He is showing a stack trace from OMPI-1.7.3, which uses hwloc 1.5.2 (note that 
OMPI 1.7.4 will use hwloc 1.7.2).

I tried to read the xml file he provided with the git hwloc master HEAD, and it 
fails:

-
❯❯❯ ./utils/lstopo -i lstopo_linuxitvc00.xml
ignoring depth attribute for object type without depth
ignoring depth attribute for object type without depth
XML component discovery failed.
hwloc_topology_load() failed (Invalid argument).
-

Any idea what's happening here?

BTW, I can apply the fix to both the OMPI SVN trunk and v1.7 branch (since OMPI 
v1.7 is now up to hwloc 1.7.2).



On Oct 31, 2013, at 1:28 PM, Paul Kapinos  wrote:

> Hello all,
>
> using 1.7.x (1.7.2 and 1.7.3 tested), we get SIGSEGV from somewhere in-deepth 
> of 'hwlock' library - see the attached screenshot.
>
> Because the error is strongly aligned to just one single node, which in turn 
> is kinda special one (see output of 'lstopo -'), it smells like an error in 
> the 'hwlock' library.
>
> Is there a way to disable hwlock or to debug it in somehow way?
> (besides to build a debug version of hwlock and OpenMPI)
>
> Best
>
> Paul
>
>
>
>
>
>
>
> --
> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
> RWTH Aachen University, Center for Computing and Communication
> Seffenter Weg 23,  D 52074  Aachen (Germany)
> Tel: +49 241/80-24915
> _______
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/
[cid:2e96120b-7548-4e23-9f50-876178585eff@emea.cisco.com]



  

  

  

  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  

  


  

  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  

  

  
  

  

  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  

  


  

  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  

  

  


  

  

  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  

  


  

  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  

  

  
  

  

  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  

  


  

  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  

  

  

  



[hwloc-users] Migrating www.open-mpi.org

2013-08-05 Thread Jeff Squyres (jsquyres)
All --

Our hosting provider will be migrating 
www.open-mpi.org<http://www.open-mpi.org> to a new machine on Wednesday.  See 
message below for details.


Begin forwarded message:

From: DongInn Kim mailto:di...@cs.indiana.edu>>
Subject: Migrating www.open-mpi.org<http://www.open-mpi.org> from milliways to 
lion
List-Post: hwloc-users@lists.open-mpi.org
Date: August 5, 2013 11:53:38 AM PDT

Dear Open MPI developers and users,

We are planning to move all the services under 
www.open-mpi.org<http://www.open-mpi.org/> to the new server on Wednesday, Aug 
7th, 2013.
This migration may need some outage time of web services (e.g., 
http://www.open-mpi.org<http://www.open-mpi.org/>) and mailing list services 
(e.g., us...@open-mpi.org<mailto:us...@open-mpi.org>, 
de...@open-mpi.org<mailto:de...@open-mpi.org>, …).

The migration schedule is following:
- Date: Wednesday, Aug 7th, 2013
- Time:
6:00am-8:00am Pacific US time
7:00am-9:00am Mountain US time
8:00am-10:00am Central US time
9:00am-11:00am Eastern US time
1:00pm-3:00pm GMT

The following services would not be available during the migration.

- Web services (e.g., www.open-mpi.org<http://www.open-mpi.org/>)
- mailing lists:
  ad...@open-mpi.org<mailto:ad...@open-mpi.org>
  announce
  bugs
  devel
  devel-core
  docs
  ft
  hwloc-announce
  hwloc-bugs
  hwloc-devel
  hwloc-svn
  hwloc-users
  llamas
  mtt-announce
  mtt-bugs
  mtt-devel
  mtt-devel-core
  mtt-results
  mtt-svn
  mtt-users
  ompi-user-docs-bugs
  ompi-user-docs-svn
  svn
  svn-docs
  svn-docs-full
  svn-full
  svn-private
  svn-private-full
  users
- Mail archives
  http://www.open-mpi.org/community/lists/
- Mercurial mirror
  Will disappear (it has long-since moved out to Bitbucket)

I hope that we will not lose any mails sent to the above mailing lists even 
during the migration but it would be really appreciated if you hold up sending 
emails and svn commit until the migration is done.

Please let me know if you have any questions or issues about this migration.

Regards,

--
- DongInn
---
CREST System administrator
Indiana University
Bloomington, IN


--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [hwloc-users] Many queries creating slow performance

2013-03-05 Thread Jeff Squyres (jsquyres)
FWIW, we do this in Open MPI: one process on each server does the lstopo (via C 
API, of course). That information is then exported to all other processes via 
XML, so that only 1 process per server walks the /sys trees, etc.


On Mar 5, 2013, at 3:25 PM, Brice Goglin  wrote:

> Hello Simon,
> 
> I don't think anybody every benchmarked this, but people have been 
> complaining this problem appearing on large machines at some point. I have a 
> large SGI machine at work, I'll see if I can reproduce this.
> 
> One solution is to export the topology to XML once and then have all your MPI 
> process read from XML. Basically, do "lstopo /tmp/foo.xml" and then export 
> HWLOC_XMLFILE=/tmp/foo.xml in the environment before starting your MPI job.
> 
> If the topology doesn't change (and that's likely the case), the XML file 
> could even be stored by the administrator in a "standard" location (not in 
> /tmp)
> 
> Brice
> 
> 
> 
> Le 05/03/2013 20:23, Simon Hammond a écrit :
>> Hi HWLOC users,
>> 
>> We are seeing some significant performance problems using HWLOC 1.6.2 on 
>> Intel's MIC products. In one of our configurations we create 56 MPI ranks, 
>> each rank then queries the topology of the MIC card before creating threads. 
>> We are noticing that if we run 56 MPI ranks as opposed to one the calls to 
>> query the topology in HWLOC are very slow, runtime goes from seconds to 
>> minutes (and upwards).
>> 
>> We guessed that this might be caused by the kernel serializing access to the 
>> /proc filesystem but this is just a hunch. 
>> 
>> Has anyone had this problem and found an easy way to change the library / 
>> calls to HWLOC so that the slow down is not experienced? Would you describe 
>> this as a bug?
>> 
>> Thanks for your help.
>> 
>> 
>> --
>> Simon Hammond
>> 
>> 1-(505)-845-7897 / MS-1319
>> Scalable Computer Architectures
>> Sandia National Laboratories, NM
>> 
>> 
>> 
>> 
>> 
>> 
>> ___
>> hwloc-users mailing list
>> 
>> hwloc-us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Solaris and hwloc

2012-09-13 Thread Jeff Squyres
On Sep 13, 2012, at 11:17 AM, Brice Goglin wrote:

> I think I am going to agree. Three comments:
> * which "binding fails" do you refer to? I assume all cases I listed.

Yes.

> * I was initially against changing the default behavior of hwloc-bind,
> but it's not like changing the ABI. There are likely very few scripts
> using hwloc-bind out there. Breaking some of them is not too bad as long
> as we give a useful error message.

Agreed.

> * If we start failing because of invalid inputs in hwloc-bind, we may
> have to do the same in hwloc-calc. The parsing code is shared anyway.

I think the philosophy should apply to all executables, actually.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Solaris and hwloc

2012-09-13 Thread Jeff Squyres
These are all good points.

That being said, Brock Palen made another good point on the OMPI list recently. 
 It was in regards to OpenFabrics registered memory, but the issue is quite 
analogous.

OMPI used to issue a warning if there wasn't enough registered memory 
available, but allow the job to run anyway (at lower performance).  Brock was 
firmly opposed to that (he's an HPC sysadmin): he didn't want jobs to run at 
all if there wasn't enough registered memory.  

One of the rationale here is that users won't tend to notice a warning at the 
top of a job's stdout/stderr -- if the job ran, that's good enough (until much 
later when they realize that they're not getting the right performance, or, 
worse, this job is impacting other jobs because its affinity is wrong).  But if 
the job doesn't run, that will get noticed immediately, and the problem will be 
fixed by a human.

Hence, it seems safer to fall back on the "if we can't give the user what they 
asked for, fail and let a human figure it out" philosophy.  Even if it means 
changing the default.  Keep in mind that if they run hwloc-bind, they're 
specifically asking for binding.

I think I'm now 80/20 in the "abort hwloc-bind if it fails to bind" camp now.  
:-)

After a little more thought, I'm also thinking that having a "it's ok if 
binding fails" CLI flag is a bad idea.  If the user really wants something to 
run without binding, then you can just do that in the shell:

-
hwloc-bind ...whatever... my_executable
if test "$?" != "0"; then
# run without binding
my_executable
fi
-

My $0.02.  :)


On Sep 13, 2012, at 4:09 AM, Brice Goglin wrote:

> (resending because the formatting was bad)
> 
> 
> Le 13/09/2012 00:26, Jeff Squyres a écrit :
>> On Sep 12, 2012, at 10:30 AM, Samuel Thibault wrote:
>> 
>>>> Sidenote: if hwloc-bind fails to bind, should we still launch the child 
>>>> process?
>>> Well, it's up to you to decide :)
>> 
>> Anyone have an opinion?  I'm 60/40 in favor of not letting it run, under the 
>> rationale that the user asked for something that we can't deliver, so we 
>> shouldn't continue.
>> 
>> Any idea what numactl does if it can't bind?
> 
> Let me add taskset to the list of tools to compare to, and distinguish
> several cases:
> 
> 1) invalid command line
> * taskset (with invalid list "2,") errors out
> * numactl (with invalid list "2,") errors out
> * hwloc-bind (with invalid location followed by "-- executable") errors
> out (considers the invalid location as the executable name)
> 
> 2) valid command-line containing *only* non-existing objects:
> * taskset errors out
> * numactl errors out
> * hwloc-bind succeeds, binds to nothing
> 
> 3) valid command-line containing some existing objects and some
> non-existing:
> * taskset succeed (ignores unexisting objects, bind to others)
> * numactl errors out
> * hwloc-bind succeeds (ignores unexisting objects, bind to others)
> 
> 4) valid command-line with only valid objects but missing OS support
> * doesn't apply to taskset and numactl afaik
> * hwloc-bind succeeds (ignores failure to bind)
> 
> 
> We have a --strict option, which translate into the STRICT binding flag
> which is documented as
>  "Request strict binding from the OS.  The function will fail if the
> binding can not be guaranteed / completely enforced."
> I usually see "non-strict" as 'if you can't do what I want, do something
> similar". I wouldn't be too bad to say that this applies to (3) (bind to
> smaller than requested).
> 
> But (2) and (4) are different. Not binding at all or binding to nothing
> is far from "non-strict". But I wonder if adding a new command-line flag
> to exit on such errors would be confusing with respect to the existing
> --strict.
> 
> We could also change the default to exit on error, and add --force to
> launch the process even on failure to bind. But changing defaults isn't
> always a good idea.
> 
> Brice
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Solaris and hwloc

2012-09-12 Thread Jeff Squyres
On Sep 12, 2012, at 6:44 PM, Samuel Thibault wrote:

>> Anyone have an opinion?  I'm 60/40 in favor of not letting it run, under the 
>> rationale that the user asked for something that we can't deliver, so we 
>> shouldn't continue.
> 
> Well, it depends on the situation. The binding might only be an
> optimization, and failing just because of that is not nice. When it's an
> administration decision, it's different, but then one would use cgroups
> & such instead.


How about adding a flag to make it fail if it doesn't bind?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Solaris and hwloc

2012-09-12 Thread Jeff Squyres
On Sep 12, 2012, at 6:42 PM, Samuel Thibault wrote:

> No, we have it, but not all solaris systems have it.


Ah, I see.  So if Siegmar had done "hwloc-bind socket:0 ..." -- assuming his 
system has lgrp support -- that should work.  right?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Solaris and hwloc

2012-09-12 Thread Jeff Squyres
On Sep 12, 2012, at 10:30 AM, Samuel Thibault wrote:

>> Sidenote: if hwloc-bind fails to bind, should we still launch the child 
>> process?
> 
> Well, it's up to you to decide :)


Anyone have an opinion?  I'm 60/40 in favor of not letting it run, under the 
rationale that the user asked for something that we can't deliver, so we 
shouldn't continue.

Any idea what numactl does if it can't bind?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Solaris and hwloc

2012-09-12 Thread Jeff Squyres
On Sep 12, 2012, at 10:28 AM, Samuel Thibault wrote:

>> He seems to get an hwloc error any time he tries to bind to more than 1 PU.  
>> Is that expected on Solaris?
> 
> Without lgrp support, unfortunately yes: the processor_bind solaris interface 
> only permits to bind to one processor.
> 
> With lgrp support, on should be able to bind oneself to sets of whole
> NUMA nodes. I don't know any interface which would provide a granularity
> between one processor and one NUMA node.


Ah.  So -- for Open MPI on Solaris using hwloc, all we can do is bind to 1 PU 
at a time.  I suppose we should release-note that...

(Sorry Siegmar! :-( )

And just so I understand -- we don't have lgrp support in hwloc, mainly because 
no one had the cycles/interest to implement it.  Is that correct?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[hwloc-users] Solaris and hwloc

2012-09-12 Thread Jeff Squyres
)
> MCW rank  (rs0.informatik.hs-fulda.de): Socket:1024.Core:0.PU:0 
> Socket:1024.Core:0.PU:1 
> Socket:1024.Core:2.PU:2 Socket:1024.Core:2.PU:3 Socket:1024.Core:4.PU:4 
> Socket:1024.Core:4.PU:5 Socket:1024.Core:6.PU:6 Socket:1024.Core:6.PU:7 
> Socket:1032.Core:8.PU:8 Socket:1032.Core:8.PU:9 Socket:1032.Core:10.PU:10 
> Socket:1032.Core:10.PU:11 Socket:1032.Core:12.PU:12 Socket:1032.Core:12.PU:13 
> Socket:1032.Core:14.PU:14 Socket:1032.Core:14.PU:15
> 
> 
> rs0 hwloc 112 hwloc-bind pu:0 -l report-bindings.sh
> MCW rank  (rs0.informatik.hs-fulda.de): Socket:1024.Core:0.PU:0 
> Socket:1024.Core:0.PU:1 
> Socket:1024.Core:2.PU:2 Socket:1024.Core:2.PU:3 Socket:1024.Core:4.PU:4 
> Socket:1024.Core:4.PU:5 Socket:1024.Core:6.PU:6 Socket:1024.Core:6.PU:7 
> Socket:1032.Core:8.PU:8 Socket:1032.Core:8.PU:9 Socket:1032.Core:10.PU:10 
> Socket:1032.Core:10.PU:11 Socket:1032.Core:12.PU:12 Socket:1032.Core:12.PU:13 
> Socket:1032.Core:14.PU:14 Socket:1032.Core:14.PU:15
> 
> rs0 hwloc 113 hwloc-bind pu:0 -p report-bindings.sh
> MCW rank  (rs0.informatik.hs-fulda.de): Socket:1024.Core:0.PU:0 
> Socket:1024.Core:0.PU:1 
> Socket:1024.Core:2.PU:2 Socket:1024.Core:2.PU:3 Socket:1024.Core:4.PU:4 
> Socket:1024.Core:4.PU:5 Socket:1024.Core:6.PU:6 Socket:1024.Core:6.PU:7 
> Socket:1032.Core:8.PU:8 Socket:1032.Core:8.PU:9 Socket:1032.Core:10.PU:10 
> Socket:1032.Core:10.PU:11 Socket:1032.Core:12.PU:12 Socket:1032.Core:12.PU:13 
> Socket:1032.Core:14.PU:14 Socket:1032.Core:14.PU:15
> 
> Is the above output helpful? Thank you very much for your help in advance.
> Do you know a C++ application which I can try to test our compiler?
> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> ##
> #    #
> # Hochschule Fulda  University of Applied Sciences   #
> # FB Angewandte Informatik  Department of Applied Computer Science   #
> ##
> # Prof. Dr. Siegmar Gross   Tel.:   +49 (0)661 9640 - 333#
> #   Fax:+49 (0)661 9640 - 349#
> # Marquardstr. 35   WWW:http://www.hs-fulda.de/~gross#
> #   E-Mail: siegmar.gr...@informatik.hs-fulda.de #
> # D-36039 Fulda  #
> ##
> ##
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Thread binding problem

2012-09-05 Thread Jeff Squyres
On Sep 5, 2012, at 2:36 PM, Gabriele Fatigati wrote:

> I don't think is a simply out of memory since NUMA node has 48 GB, and I'm 
> allocating just 8 GB.

Mmm.  Probably right.

Have you run your application through valgrind or another memory-checking 
debugger?

I've seen cases of heap corruption lead to malloc incorrectly failing with 
ENOMEM.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Thread binding problem

2012-09-05 Thread Jeff Squyres
Perhaps you simply have run out of memory on that NUMA node, and therefore the 
malloc failed.  Check "numactl --hardware", for example.

You might want to check the output of numastat to see if one or more of your 
NUMA nodes have run out of memory. 


On Sep 5, 2012, at 12:58 PM, Gabriele Fatigati wrote:

> I've reproduced the problem in a small MPI + OpenMP code.
> 
> The error is the same: after some memory bind, gives "Cannot allocate memory".
> 
> Thanks.
> 
> 2012/9/5 Gabriele Fatigati 
> Downscaling the matrix size, binding works well, but the memory available is 
> enought also using more big matrix, so I'm a bit confused.
> 
> Using the same big matrix size without binding the code works well, so how I 
> can explain this behaviour?
> 
> Maybe hwloc_set_area_membind_nodeset introduces other extra allocation that 
> are resilient after the call?
> 
> 
> 
> 2012/9/5 Brice Goglin 
> An internal malloc failed then. That would explain why your malloc failed too.
> It looks like you malloc'ed too much memory in your program?
> 
> Brice
> 
> 
> 
> 
> Le 05/09/2012 15:56, Gabriele Fatigati a écrit :
>> An update:
>> 
>> placing strerror(errno) after hwloc_set_area_membind_nodeset  gives: "Cannot 
>> allocate memory"
>> 
>> 2012/9/5 Gabriele Fatigati 
>> Hi,
>> 
>> I've noted that hwloc_set_area_membind_nodeset return -1 but errno is not 
>> equal to EXDEV or ENOSYS. I supposed that these two case was the two unique 
>> possibly.
>> 
>> From the hwloc documentation:
>> 
>> -1 with errno set to ENOSYS if the action is not supported
>> -1 with errno set to EXDEV if the binding cannot be enforced
>> 
>> 
>> Any other binding failure reason? The memory available is enought.
>> 
>> 2012/9/5 Brice Goglin 
>> Hello Gabriele,
>> 
>> The only limit that I would think of is the available physical memory on 
>> each NUMA node (numactl -H will tell you how much of each NUMA node memory 
>> is still available).
>> malloc usually only fails (it returns NULL?) when there no *virtual* memory 
>> anymore, that's different. If you don't allocate tons of terabytes of 
>> virtual memory, this shouldn't happen easily.
>> 
>> Brice
>> 
>> 
>> 
>> 
>> Le 05/09/2012 14:27, Gabriele Fatigati a écrit :
>>> Dear Hwloc users and developers,
>>> 
>>> 
>>> I'm using hwloc 1.4.1 on a multithreaded program in a Linux platform, where 
>>> each thread bind many non contiguos pieces of a big matrix using in a very 
>>> intensive way hwloc_set_area_membind_nodeset function:
>>> 
>>> hwloc_set_area_membind_nodeset(topology, punt+offset, len, nodeset, 
>>> HWLOC_MEMBIND_BIND, HWLOC_MEMBIND_THREAD | HWLOC_MEMBIND_MIGRATE);
>>> 
>>> Binding seems works well, since the returned code from function is 0 for 
>>> every calls.
>>> 
>>> The problems is that after binding, a simple little new malloc fails, 
>>> without any apparent reason.
>>> 
>>> Disabling memory binding, the allocations works well.  Is there any knows 
>>> problem if  hwloc_set_area_membind_nodeset is used intensively?
>>> 
>>> Is there some operating system limit for memory pages binding?
>>> 
>>> Thanks in advance.
>>> 
>>> -- 
>>> Ing. Gabriele Fatigati
>>> 
>>> HPC specialist
>>> 
>>> SuperComputing Applications and Innovation Department
>>> 
>>> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>>> 
>>> www.cineca.itTel:   +39 051 6171722
>>> 
>>> g.fatigati [AT] cineca.it   
>>> 
>>> 
>>> ___
>>> hwloc-users mailing list
>>> 
>>> hwloc-us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>> 
>> 
>> 
>> 
>> -- 
>> Ing. Gabriele Fatigati
>> 
>> HPC specialist
>> 
>> SuperComputing Applications and Innovation Department
>> 
>> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>> 
>> www.cineca.itTel:   +39 051 6171722
>> 
>> g.fatigati [AT] cineca.it   
>> 
>> 
>> 
>> -- 
>> Ing. Gabriele Fatigati
>> 
>> HPC specialist
>> 
>> SuperComputing Applications and Innovation Department
>> 
>> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>> 
>> ww

Re: [hwloc-users] HWLoc Documentation pages 404's

2012-08-10 Thread Jeff Squyres
Shame on them for using a hwloc_* prefix for their functions.  :-)

On Aug 10, 2012, at 5:24 PM, Brock Palen wrote:

> Yep very odd,
> 
> Looks like torque wrote a wrapper then for some hwloc functions.
> 
> BTW working with cgroups/cpusets in our resource manager  hwloc-info --pid  
> is _wonderful_  
> 
> I think I am good to go.
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Aug 10, 2012, at 5:14 PM, Jeff Squyres wrote:
> 
>> I don't know why Google is pointing you there...
>> 
>> I went back as far as as hwloc 1.3 and I cannot find a function named 
>> hwloc_bitmap_displaylist() -- that's probably why you can't find any 
>> reference to it in the docs.  :-)
>> 
>> 
>> 
>> On Aug 10, 2012, at 4:55 PM, Brock Palen wrote:
>> 
>>> Google is giving me this url:
>>> www.open-mpi.org/projects/hwloc//doc/v1.5/a2.php
>>> 
>>> When i searched for hwloc_bitmap_displaylist()   (for which I can find 
>>> nothing nor a manpage :-) )
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>> On Aug 10, 2012, at 4:26 PM, Jeff Squyres wrote:
>>> 
>>>> Try looking here:
>>>> 
>>>> http://www.open-mpi.org/projects/hwloc/doc/
>>>> 
>>>> You have an extra "projects" in your URL.  How did you get to that URL?  
>>>> Do we have a bug in our web pages somewhere?
>>>> 
>>>> 
>>>> On Aug 10, 2012, at 3:56 PM, Brock Palen wrote:
>>>> 
>>>>> http://www.open-mpi.org/projects/projects/hwloc/doc/
>>>>> 
>>>>> Oh noooss!!!
>>>>> 
>>>>> Brock Palen
>>>>> www.umich.edu/~brockp
>>>>> CAEN Advanced Computing
>>>>> bro...@umich.edu
>>>>> (734)936-1985
>>>>> 
>>>>> 
>>>>> 
>>>>> ___
>>>>> hwloc-users mailing list
>>>>> hwloc-us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>>>> 
>>>> 
>>>> -- 
>>>> Jeff Squyres
>>>> jsquy...@cisco.com
>>>> For corporate legal information go to: 
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>> 
>>>> 
>>>> ___
>>>> hwloc-users mailing list
>>>> hwloc-us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>>> 
>>> 
>>> ___
>>> hwloc-users mailing list
>>> hwloc-us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> hwloc-users mailing list
>> hwloc-us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> 
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] HWLoc Documentation pages 404's

2012-08-10 Thread Jeff Squyres
I don't know why Google is pointing you there...

I went back as far as as hwloc 1.3 and I cannot find a function named 
hwloc_bitmap_displaylist() -- that's probably why you can't find any reference 
to it in the docs.  :-)



On Aug 10, 2012, at 4:55 PM, Brock Palen wrote:

> Google is giving me this url:
> www.open-mpi.org/projects/hwloc//doc/v1.5/a2.php
> 
> When i searched for hwloc_bitmap_displaylist()   (for which I can find 
> nothing nor a manpage :-) )
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Aug 10, 2012, at 4:26 PM, Jeff Squyres wrote:
> 
>> Try looking here:
>> 
>> http://www.open-mpi.org/projects/hwloc/doc/
>> 
>> You have an extra "projects" in your URL.  How did you get to that URL?  Do 
>> we have a bug in our web pages somewhere?
>> 
>> 
>> On Aug 10, 2012, at 3:56 PM, Brock Palen wrote:
>> 
>>> http://www.open-mpi.org/projects/projects/hwloc/doc/
>>> 
>>> Oh noooss!!!
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>> ___
>>> hwloc-users mailing list
>>> hwloc-us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> _______
>> hwloc-users mailing list
>> hwloc-us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> 
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] HWLoc Documentation pages 404's

2012-08-10 Thread Jeff Squyres
Try looking here:

  http://www.open-mpi.org/projects/hwloc/doc/

You have an extra "projects" in your URL.  How did you get to that URL?  Do we 
have a bug in our web pages somewhere?


On Aug 10, 2012, at 3:56 PM, Brock Palen wrote:

> http://www.open-mpi.org/projects/projects/hwloc/doc/
> 
> Oh noooss!!!
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Hwloc error.

2012-05-30 Thread Jeff Squyres
On May 30, 2012, at 11:22 AM, Samuel Thibault wrote:

> i.e. the kernel reports that socket 0 is completely in node 1, while
> socket 1 is half in node 1 and half in node 2. Do you have more
> information about what the machine actually contains socket- and
> NUMA-wise? The dell website is not really felpful, it talks about 4-16
> cores for the DL165 G7, while you have 24.


How old is your Dell BIOS firmware?  You might need to update it.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Understanding hwloc-ps output

2012-05-30 Thread Jeff Squyres
 string): 
   ompi_bound: socket 0[core 1] 
  current_binding: socket 0[core 1] 
   exists: socket 0 has 4 cores, socket 1 has 4 cores
rank 1 (layout): 
   ompi_bound: [. B . .][. . . .]
  current_binding: [. B . .][. . . .]
   exists: [. . . .][. . . .]
%
-

Note, too, that OMPI 1.6 only lets you bind to sockets and cores, which is why 
the above output doesn't show hyperthreads (even though they are there, 
according to the lstopo output).  

That being said, we have completely revamped process/processor affinity support 
in what will become OMPI v1.7 (i.e., the current OMPI SVN trunk).  For example, 
OMPI 1.7 will let you bind to hyperthreads (and caches and ...others).  If you 
run the same example OMPI_Affinity_str() program with what will become OMPI 
v1.7, the output is a little more expressive -- it shows the hyperthreads:

-
% cd 
% cd ompi/mpiext/affinity/c
% mpicc example.c -o example
% mpirun --mca btl tcp,sm,self --report-bindings --host svbu-mpi056 --np 2 
--bind-to-core ./example
[svbu-mpi056:25041] [[23016,0],1] odls:default binding child [[23016,1],0] to 
cpus 0,8
[svbu-mpi056:25041] [[23016,0],1] odls:default binding child [[23016,1],1] to 
cpus 2,10
[svbu-mpi056:25042] [[23016,1],0] is bound to cpus 0,8
[svbu-mpi056:25043] [[23016,1],1] is bound to cpus 2,10
rank 0 (resource string): 
   ompi_bound: socket 1[core 0[hwt 0-1]]
  current_binding: socket 1[core 0[hwt 0-1]]
   exists: socket 1 has 4 cores, each with 2 hwts; socket 0 has 4 
cores, each with 2 hwts
rank 0 (layout): 
   ompi_bound: [BB/../../..][../../../..]
  current_binding: [BB/../../..][../../../..]
   exists: [../../../..][../../../..]
rank 1 (resource string): 
   ompi_bound: socket 1[core 1[hwt 0-1]]
  current_binding: socket 1[core 1[hwt 0-1]]
   exists: socket 1 has 4 cores, each with 2 hwts; socket 0 has 4 
cores, each with 2 hwts
rank 1 (layout): 
   ompi_bound: [../BB/../..][../../../..]
  current_binding: [../BB/../..][../../../..]
   exists: [../../../..][../../../..]
%
-

I notice the --report-bindings output is a bit different in 1.7 vs. 1.6.  We 
should clarify this stuff, make it user-friendly, and make it the same (as much 
as possible) between 1.6.x and 1.7.x.  I'll work on that.


On May 30, 2012, at 10:06 AM, Brice Goglin wrote:

> Jeff,
> What is the displayed bitmask in OMPI 1.6? Is it the hwloc bitmask? Or
> the OMPI bitmask made of OMPI indexes?
> Brice
> 
> 
> 
> Le 30/05/2012 16:01, Jeff Squyres a écrit :
>> You might want to try the OMPI tarball that is about to become OMPI v1.6.1 
>> -- we made a bunch of affinity-related fixes, and it should be much more 
>> predictable / stable in what it does in terms of process binding:
>> 
>>http://www.open-mpi.org/~jsquyres/unofficial/
>> 
>> (these affinity fixes are not yet in a nightly 1.6 tarball because we're 
>> testing them before they get committed to the OMPI v1.6 SVN branch)
>> 
>> 
>> On May 30, 2012, at 9:54 AM, Brice Goglin wrote:
>> 
>>> Hello Youri,
>>> When using openmpi 1.4.4 with --np 2 --bind-to-core --bycore” it reports 
>>> the following:
>>>> [hostname:03339] [[17125,0],0] odls:default:fork binding child 
>>>> [[17125,1],0] to cpus 0001
>>>> 
>>>> [hostname:03339] [[17125,0],0] odls:default:fork binding child 
>>>> [[17125,1],1] to cpus 0002
>>>> 
>>> Bitmask 0001 and 0002 mean CPUs with physical indexes 0 and 1 in OMPI 1.4. 
>>> So that corresponds to the first core of each socket, and that matches what 
>>> hwloc-ps says. Try "hwloc-ps -c" should show the same bitmask.
>>> 
>>> However, I agree that these are not adjacent cores, but I don't know enough 
>>> of OMPI binding options to understand what it was supposed to do in your 
>>> case.
>>> 
>>> Brice
>>> 
>>> ___
>>> hwloc-users mailing list
>>> hwloc-us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>> 
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Understanding hwloc-ps output

2012-05-30 Thread Jeff Squyres
You might want to try the OMPI tarball that is about to become OMPI v1.6.1 -- 
we made a bunch of affinity-related fixes, and it should be much more 
predictable / stable in what it does in terms of process binding:

http://www.open-mpi.org/~jsquyres/unofficial/

(these affinity fixes are not yet in a nightly 1.6 tarball because we're 
testing them before they get committed to the OMPI v1.6 SVN branch)


On May 30, 2012, at 9:54 AM, Brice Goglin wrote:

> Hello Youri,
> When using openmpi 1.4.4 with --np 2 --bind-to-core --bycore” it reports the 
> following:
>> [hostname:03339] [[17125,0],0] odls:default:fork binding child [[17125,1],0] 
>> to cpus 0001
>> 
>> [hostname:03339] [[17125,0],0] odls:default:fork binding child [[17125,1],1] 
>> to cpus 0002
>> 
> 
> Bitmask 0001 and 0002 mean CPUs with physical indexes 0 and 1 in OMPI 1.4. So 
> that corresponds to the first core of each socket, and that matches what 
> hwloc-ps says. Try "hwloc-ps -c" should show the same bitmask.
> 
> However, I agree that these are not adjacent cores, but I don't know enough 
> of OMPI binding options to understand what it was supposed to do in your case.
> 
> Brice
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[hwloc-users] #tgfh (thank God for hwloc)

2012-05-18 Thread Jeff Squyres
Yesterday, I was installing 2 machines in a physically remote location -- 
meaning that I did not have access to see or touch the machines.  Although the 
2 machines are slightly different models from each other, they both have 
multiple Ethernet ports: some LOM, and 2 ports on a PCI 10GB Ethernet NIC.

All the Ethernet ports are live and connected to different networks.

I was working on setting up the 2 ports on the PCI card.  #tgfh, because hwloc 
clearly showed me which ports were on a PCI device (by grouping and by vendor 
ID) and told me exactly what their ethX devices were.  And, by extension, it 
showed me which ports were LOM (and what their ethX devices were).  See the 2 
PDFs attached for what hwloc showed me on each machine.

This allowed me to go edit the relevant 
/etc/sysconfig/network-scripts/ifcfg-ethX scripts and be up and running within 
minutes.

Yay hwloc!!

(sorry; I just felt the need to share this story :-) )

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


svbu-mpi058.pdf
Description: Adobe PDF document


svbu-mpi059.pdf
Description: Adobe PDF document


Re: [hwloc-users] PCI devices in the topology

2012-02-10 Thread Jeff Squyres
On Feb 10, 2012, at 3:37 PM, Brice Goglin wrote:

> All objects of the same type are *always* at the same depth (for caches
> and groups, replace "same type" with "same type and same level" so that
> L1 is not at the same depth as L3). That works even if your topology is
> not symmetric at all, because a children can have a depth that is
> different from its parent depth plus one.
> 
> PCI objects are not placed in levels are other regular objects do. There
> are in specific list. However, to make the API more uniform, we have
> some fake depth values that let us identify and walk in the list of
> bridges, PCI devices or OS devices.
> 
> In the above case, the NUMA node P#0 should be at depth 1, it has two
> children. The first one is Socket P#1 at depth 2. The second one is a
> hostbridge at depth -3 (fake depth for bridges iirc).

Ok.  But in terms of walking down the hwloc tree, PCI devices will show up in 
someone's children[] array, right?  I.e., they're not in a separate list 
somewhere, right?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[hwloc-users] PCI devices in the topology

2012-02-10 Thread Jeff Squyres
When PCI devices are put into the tree, do they potentially make other objects 
be a different depths?  

For example, http://www.open-mpi.org/projects/hwloc/devel09-pci.png has a PCI 
bridge hanging off a socket.  Are the cores on sockets P0 and P1 at the same 
depth?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Compiling hwloc into a static library on Windows and Linux

2012-01-09 Thread Jeff Squyres
On Jan 9, 2012, at 5:25 PM, Samuel Thibault wrote:

>> However, when I specify the --enable-embedded-mode flag in configure in 
>> Linux,
>> no libraries are built at all - the specified prefix directory contains only
>> empty directories.
> 
> But the library is built, it's just not installed because projects often
> prefer to link the library in, or something similar. If you want to
> install libhwloc.a, simply fetch it from src/.libs/

To be clear: I think you're misunderstanding what --enable-embedded-mode is 
for.  Per Samuel's comment, I think you want --enable-static (and possibly 
--disable-shared).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] GPU/NIC/CPU locality

2011-11-29 Thread Jeff Squyres
On Nov 29, 2011, at 1:04 PM, Brice Goglin wrote:

> "XML output" should be "XML input/output" or "XML support".

Done:

-
Hwloc optional build support status (more details can be found above):

Probe / display PCI devices: yes
Graphical output (Cairo):yes
XML input / output:  full
Memory support:  binding, set policy, migrate pages
---------

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [hwloc-users] GPU/NIC/CPU locality

2011-11-29 Thread Jeff Squyres
On Nov 29, 2011, at 12:01 PM, Brice Goglin wrote:

> Yes, always installed. There are some configure checks for verbs, but
> it's only used for enabling verbs-related helper testing.

Ok, how's this for output at the end of configure? 

Linux:

-
Hwloc optional build support status (more details can be found above):

Probe / display PCI devices: yes
Graphical output (Cairo):yes
XML output:  full
Memory support:  binding, set policy, migrate pages
-

OS X:

-
Hwloc optional build support status (more details can be found above):

Probe / display PCI devices: no
Graphical output (Cairo):yes
XML output:  full
Memory support:  none
-

XML support will show "basic" if libxml2 is not found.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [hwloc-users] GPU/NIC/CPU locality

2011-11-29 Thread Jeff Squyres
On Nov 29, 2011, at 11:53 AM, Brice Goglin wrote:

>> What about MX, verbs, Cuda, ...?
> 
> MX and verbs are not used internally, we just have public helpers to
> interoperate with them (and tests).

I forget -- are the helpers installed/available even if the MX 
headers/libraries are not found at configure time?  (ditto for verbs, cuda, 
etc.)

> Same for cuda in trunk (until Samuel's cuda branch gets merged).
> 
> Brice
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] GPU/NIC/CPU locality

2011-11-29 Thread Jeff Squyres
On Nov 29, 2011, at 10:33 AM, Brice Goglin wrote:

>> - Kerrighard
>> - PCI device support
>> - XML support
> 
> I would put XML, PCI, Cairo and libnuma

What about MX, verbs, Cuda, ...?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [hwloc-users] GPU/NIC/CPU locality

2011-11-29 Thread Jeff Squyres
On Nov 29, 2011, at 10:16 AM, Stefan Eilemann wrote:

>> FWIW, I've traditionally been against such things for two reasons:
> 
> Your call, really. The information is there and not too hard to find, but I 
> missed it on the first run. Most software I know provides this in a very 
> concise list at the end (Supported: A B C\n Unsupported: D E F).

Let me throw this back to Brice / Samuel...

If we had such a thing at the bottom of configure, what items should we show?  
I can think of the following obvious ones offhand:

- Kerrighard
- PCI device support
- XML support

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] GPU/NIC/CPU locality

2011-11-29 Thread Jeff Squyres
On Nov 29, 2011, at 7:25 AM, Stefan Eilemann wrote:

>> You are probably missing the libpci-devel package.
> 
> Thanks, that either doesn't exist or wasn't installed on Redhat. It works now.
> 
> I think messages of found/not found optional modules could be more prominent 
> at the end of the configure process.

FWIW, I've traditionally been against such things for two reasons:

1. The information *was* displayed above (i.e., that pci-devel wasn't 
found/wasn't usable/whatever).  I realize that most people don't read the 
stdout of configure at all, but all the information you need is already there.

2. A list of what will/will not be built at the end tends to grow lengthy such 
that it dilutes the value of repeating the information at the end.

That being said, I can *somewhat* see the value of displaying a user-friendly 
"PCI device support will not be built" vs. the output of a configure test, 
which might be somewhat obscure.  However, in hwloc's case, the configure test 
output is pretty self-evident.  Examples:

checking for PCI... no
checking pci/pci.h usability... no
checking pci/pci.h presence... no
checking for pci/pci.h... no
checking for LIBXML2... yes
checking for xmlNewDoc... yes
checking for final LIBXML2 support... yes

A simple string search for "pci" and "xml" will find these lines in the 
configure output.  Assumedly, if you're building from source, you've likely got 
at least *some* experience and it shouldn't be unreasonable to ask you to go 
look in the output of configure.

Don't get me wrong -- I'm not dead-set against a listing at the bottom.  I just 
find it redundant and somewhat of a maintenance hassle.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Re : lstopo on multiple machines

2011-08-17 Thread Jeff Squyres
On Aug 17, 2011, at 2:26 AM, Brice Goglin wrote:

>> What about an MPI version of lstopo ?
> 
> If you want to see the entire MPI job topology within a single topology,
> doing it inside hwloc would likely require to check for mpirun/mpiexec
> parameters and so on at configure... big mess. Something like below with
> the previously proposed API/utility may be enough:
>mpirun lstopo .xml
>hwloc_xml_agregate cluster.xml *.xml
>export HWLOC_XMLFILE=cluster.xml

Much as an MPI version sounds interesting (even potentially as a 3rd-party 
tool), I have to agree that a shell script similar to what Brice typed might be 
much easier.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Re : lstopo on multiple machines

2011-08-16 Thread Jeff Squyres
I'd be against hwloc automatically spreading across multiple machines.  I think 
there are plenty of tools to do that already.

That being said, having better support for being able to aggregate data from 
multiple hwloc instances (e.g., lstopo) on multiple machines into a single, 
cohesive map, would be great (waving hands here; I have no specific 
suggestions).


On Aug 16, 2011, at 10:02 AM, Brice Goglin wrote:

> 
> Hello Seb,
> Hwloc only looks at the local machine, there's no support for multinode 
> topology detection so far. We are considering adding it but we don't know yet 
> what users want to do with it, if it should be in the core or not, automatic 
> or nor. Your feedback is welcome.
> Brice
> 
> - Reply message -
> De : "PULVERAIL S?bastien" 
> Pour?: 
> Objet : [hwloc-users] lstopo on multiple machines
> Date : mar., ao?t 16, 2011 15:04
> 
> 
> 
> 
> Hello,
> 
> 
> 
> I have two machines I use for running my programs on multiple nodes (with
> hydra or slurm).
> 
> When I launch my lstopo command, only one machine characteristics are
> printed.
> 
> How can I tell HWLOC to look for those two machines ?
> 
> 
> 
> --
> 
> Seb
> 
> 
> 
> _______
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[hwloc-users] Article about hwloc published in Linux Pro Magazine

2011-07-14 Thread Jeff Squyres
Woo hoo!  Brice, Samuel, and I wrote an article about hwloc for Linux Pro 
Magazine.  My copy just showed up in the mail:


http://blogs.cisco.com/performance/hwloc-article-published-in-linux-pro-magazine/

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Patch to disable GCC __builtin_ operations

2011-06-08 Thread Jeff Squyres
On Jun 8, 2011, at 5:30 PM, Dave Goodell wrote:

>> Is there a reason we wouldn't disable it in OMPI's hwloc by default?
> 
> Performance will be better when left enabled on platforms where the compiler 
> and the architecture are in agreement...

I'm not too concerned about hwloc's performance in OMPI -- it'll be used during 
initialization only.  Unless there's a dramatic difference for, say, 
large-core-count machines, I'd be inclined to just disable it unless there's 
some reason to leave it on.  It's one less thing that a user will have to 
know/remember to --disable, even in Josh's exotic case.

> IMO Josh's use case is a bit exotic.  He's using one system's compiler as an 
> approximation of an appropriate compiler for another system instead of using 
> a cross compiler or compiling in an identical environment.  That viewpoint 
> may or may not be shared by the OMPI developers.
> 
> -Dave
> 
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Patch to disable GCC __builtin_ operations

2011-06-08 Thread Jeff Squyres
Is there a reason we wouldn't disable it in OMPI's hwloc by default?

On Jun 8, 2011, at 5:14 PM, Josh Hursey wrote:

> In short, I haven't yet. I figured out the problem was in hwloc, and
> started with the hwloc branch by itself.
> 
> In Open MPI, we should be able to pass the --disable-gcc-builtin from
> the main configure, right (since we pull in config/hwloc_internal.m4)?
> So we would pass it similar to how we had to pass --disable-xml to
> turn off that feature in the builtin hwloc (before it was turned off
> by default).
> 
> -- Josh
> 
> On Wed, Jun 8, 2011 at 4:50 PM, Jeff Squyres  wrote:
>> Josh --
>> 
>> How did you get this disabled from within OMPI?  We don't invoke hwloc's 
>> configure via sub-shell; we directly invoke its m4, so we don't have an 
>> opportunity to pass --disable-gcc-builtin.  Unless you passed that to the 
>> top-level OMPI configure script...?
>> 
>> 
>> On Jun 8, 2011, at 4:28 PM, Josh Hursey wrote:
>> 
>>> (This should have gone to the devel list)
>>> 
>>> The attached patch adds a configure option (--disable-gcc-builtin) to
>>> disable the use of GCC __builtin_ operations, even if the GCC compiler
>>> supports them. The patch is a diff from the r3509 revision of the
>>> hwloc trunk.
>>> 
>>> I hit a problem when installing hwloc statically on a machine with a
>>> slightly different gcc support libraries and OSs on the head/compile
>>> node versus the compute nodes. The builtin functions would cause hwloc
>>> to segfault when run on the compute nodes. By disabling the builtin
>>> operations, and using the more portable techniques seemed to do the
>>> trick.
>>> 
>>> This problem first became apparent when using hwloc as part of Open
>>> MPI. In Open MPI the mpirun process runs on the headnode, so the hwloc
>>> install would work in the mpirun process but cause the compute
>>> processes to segv.
>>> 
>>> Can you review the patch, and apply it to the trunk? Once the patch is
>>> in the trunk, then I'll work on the Open MPI folks to update their
>>> revision.
>>> 
>>> Thanks,
>>> Josh
>>> 
>>> --
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> http://users.nccs.gov/~jjhursey
>>> ___
>>> hwloc-users mailing list
>>> hwloc-us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>> 
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> hwloc-users mailing list
>> hwloc-us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>> 
>> 
> 
> 
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Patch to disable GCC __builtin_ operations

2011-06-08 Thread Jeff Squyres
Josh --

How did you get this disabled from within OMPI?  We don't invoke hwloc's 
configure via sub-shell; we directly invoke its m4, so we don't have an 
opportunity to pass --disable-gcc-builtin.  Unless you passed that to the 
top-level OMPI configure script...?


On Jun 8, 2011, at 4:28 PM, Josh Hursey wrote:

> (This should have gone to the devel list)
> 
> The attached patch adds a configure option (--disable-gcc-builtin) to
> disable the use of GCC __builtin_ operations, even if the GCC compiler
> supports them. The patch is a diff from the r3509 revision of the
> hwloc trunk.
> 
> I hit a problem when installing hwloc statically on a machine with a
> slightly different gcc support libraries and OSs on the head/compile
> node versus the compute nodes. The builtin functions would cause hwloc
> to segfault when run on the compute nodes. By disabling the builtin
> operations, and using the more portable techniques seemed to do the
> trick.
> 
> This problem first became apparent when using hwloc as part of Open
> MPI. In Open MPI the mpirun process runs on the headnode, so the hwloc
> install would work in the mpirun process but cause the compute
> processes to segv.
> 
> Can you review the patch, and apply it to the trunk? Once the patch is
> in the trunk, then I'll work on the Open MPI folks to update their
> revision.
> 
> Thanks,
> Josh
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] HWLOC problem

2011-06-07 Thread Jeff Squyres
(brought over from the OMPI user's list)

This likely means you installed hwloc to a non-standard location (meaning that 
your system is not looking for shared libraries in $hwloc_prefix/lib by 
default).  

If you prepend/append your LD_LIBRARY_PATH environment variable (or set it, if 
it's not already set) to include $hwloc_prefix/lib, it should find hwloc's 
shared library and lstopo -- and all of its friends -- should work fine.


On Jun 7, 2011, at 12:51 PM, vaibhav dutt wrote:

> Hi,
> 
> I have installed HWLOC 1.2 on my cluster , each node has two Intel Xeon E5450 
> quad cores.
> When I try to execute the command "lstopo" to determine the hardware topology 
> of my system,
> I get an error like:
> 
> ./lstopo: error while loading shared libraries: libhwloc.so.3: cannot open 
> shared object file: No such file or directory
> 
> 
> Can anyone please help me as to what is the reason for this error and where 
> can I find this shared
> library.
> 
> Thanks.
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[hwloc-users] Fwd: [OMPI devel] problem with absent L3 on AMD CPU

2011-04-10 Thread Jeff Squyres
Moving this patch over to the hwloc users list...

Begin forwarded message:

> From: Andriy Gapon 
> Date: April 10, 2011 3:47:37 AM EDT
> To: de...@open-mpi.org
> Subject: [OMPI devel] problem with absent L3 on AMD CPU
> Reply-To: Open MPI Developers 
> 
> 
> It seems that lstopo can get mightly confused with AMD Athlon II processor
> (family 10h) that doesn't have L3 cache.
> 
> I believe that the following patch should fix that:
> --- src/topology-x86.c.orig   2011-04-10 10:38:39.370239628 +0300
> +++ src/topology-x86.c2011-04-10 10:38:44.573256245 +0300
> @@ -59,10 +59,6 @@
>   unsigned cachenum;
>   unsigned size = 0;
> 
> -  cachenum = infos->numcaches++;
> -  infos->cache = realloc(infos->cache, 
> infos->numcaches*sizeof(*infos->cache));
> -  cache = &infos->cache[cachenum];
> -
>   if (level == 1)
> size = ((cpuid >> 24)) << 10;
>   else if (level == 2)
> @@ -72,6 +68,10 @@
>   if (!size)
> return;
> 
> +  cachenum = infos->numcaches++;
> +  infos->cache = realloc(infos->cache, 
> infos->numcaches*sizeof(*infos->cache));
> +  cache = &infos->cache[cachenum];
> +
>   cache->type = 1;
>   cache->level = level;
>   if (level <= 2)
> 
> 
> Otherwise, numcaches gets incremented and the cache array grows a new entry, 
> but
> that new entry is not initialized.  Maybe this is an OS or envrionment 
> specific
> problem, but at least here on FreeBSD the new memory is not zero-ed out and
> POSIX doesn't require realloc to do that.
> 
> This report is for the version 1.1.2.
> Apologies for the noise if this problem is already fixed in newer code.
> 
> Thanks!
> -- 
> Andriy Gapon
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?

2011-02-14 Thread Jeff Squyres
On Feb 14, 2011, at 8:15 PM, Siew Yin Chan wrote:

> Thank you very much for your input which makes my direction pretty clear now. 
> Depending on the progress of my project, I may be adventurous to try the 
> nightly tarball, or may wait until a stable version is released.

FWIW, we release 1.5.2rc1 today.  It contains the hwloc stuff.

> I appreciate the hard work of the OMPI team, and am look forward to a more 
> flexible binding option in OMPI's future release.

Thanks!  We're shooting for 1.5.3, but it might slip to 1.5.4.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?

2011-02-14 Thread Jeff Squyres
On Feb 14, 2011, at 9:35 AM, Siew Yin Chan wrote:

> 1. I tried Open MPI 1.5.1 before turning to hwloc-bind. Yep. Open MPI 1.5.1 
> does provide the --bycore and --bind-to-core option, but this option seems to 
> bind processes to cores on my machine according to the *physical* indexes:

FWIW, you might want to try one of the OMPI 1.5.2 nightly tarballs -- we 
switched the process affinity stuff to hwloc in 1.5.2 (the 1.5.1 stuff uses a 
different mechanism).

> FYI, my testing environment and application imposes these requirements for 
> optimum performance:
> 
> i. Different binaries optimized for heterogeneous machines. This necessitates 
>  MIMD, and can be done in OMPI using the -app option (providing an 
> application context file).
> ii. The application is communication-sensitive. Thus, fine-grained process 
> mapping on *machines* and on *cores* is required to minimize inter-machine 
> and inter-socket communication costs occurring on the network and on the 
> system bus. Specifically, processes should be mapped onto successive cores of 
> one socket before the next socket is considered, i.e., socket.0:core0-3, then 
> socket.1:core0-3. In this case, the communication among neighboring rank 0-3 
> will be confined to socket 0 without going through the system bus. Same for 
> rank 4-7 on socket 1. As such, the order of the cores should follow the 
> *logical* indexes.

I think that OMPI 1.5.2 should do this for you -- rather than following and 
logical/physical ordering, it does what you describe: traverses successive 
cores on a socket before going to the next socket (which happens to correspond 
to hwloc's logical ordering, but that was not the intent).

FWIW, we have a huge revamp of OMPI's affinity support on the mpirun command 
line that will offer much more flexible binding choices.

> Initially, I tried combining the features of rankfile and appfile, e.g.,
> 
> $ cat rankfile8np4
> rank 0=compute-0-8 slot=0:0
> rank 1=compute-0-8 slot=0:1
> rank 2=compute-0-8 slot=0:2
> rank 3=compute-0-8 slot=0:3
> $ cat rankfile9np4
> rank 0=compute-0-9 slot=0:0
> rank 1=compute-0-9 slot=0:1
> rank 2=compute-0-9 slot=0:2
> rank 3=compute-0-9 slot=0:3
> $ cat my_appfile_rankfile
> --host compute-0-8 -rf rankfile8np4 -np 4 ./test1
> --host compute-0-9 -rf rankfile9np4 -np 4 ./test2
> $ mpirun -app my_appfile_rankfile
> 
> but found out that only the rankfile stated on the first line took effect; 
> the second was ignored completely. After some time of googling and trial and 
> error, I decided to try an external binder, and this direction led me to 
> hwloc-bind.
> 
> Maybe I should bring the issue of rankfile + appfile to the OMPI mailing list.

Yes.  

I'd have to look at it more closely, but it's possible that we only allow one 
rankfile per job -- i.e., that the rankfile should specify all the procs in the 
job, not on a per-host basis.  But perhaps we don't warn/error if multiple 
rankfiles are used; I would consider that a bug.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?

2011-02-14 Thread Jeff Squyres
On Feb 13, 2011, at 4:07 AM, Brice Goglin wrote:

>> $ mpirun -np 4 hwloc-bind socket:0.core:0-3 ./test
>> 
>> 1. Does hwloc-bind map the processes *sequentially* on *successive* cores of 
>> the socket?
> 
> No. Each hwloc-bind command in the mpirun above doesn't know that there are 
> other hwloc-bind instances on the same machine. All of them bind their 
> process to all cores in the first socket.

To further underscore this point, mpirun launched 4 copies of:

hwloc-bind socket:0.core:0-3 ./test

Which means that all 4 processes bound to exactly the same thing.

If you want each process to bind to a *different* set of PU's, then you have 
two choices:

1. See Open MPI 1.5.1's mpirun(1) man page.  There's new affinity options in 
the OMPI 1.5 series, such as --bind-to-core and --bind-to-socket.  We wrote 
them up in the FAQ, too.

2. Write a wrapper script that looks at the Open MPI environment variables 
OMPI_COMM_WORLD_RANK, or OMPI_COMM_WORLD_LOCAL_RANK, or 
OMPI_COMM_WORLD_NODE_RANK and decides how to invoke hwloc-bind.  For example, 
something like this:

mpirun -np 4 my_wrapper.sh ./test

where my_wrapper.sh is:

-
#!/bin/sh

if test "$OMPI_COMM_WORLD_RANK" = "0"; then
bind_string=...whatever...
else
bind_string=...whatever...
fi
hwloc-bind $bind_string $*
-

Something like that.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] hwloc@SC10

2010-11-12 Thread Jeff Squyres
Brice will also be giving a ~10 min short talk on hwloc in the Cisco booth; 
stop by and say hello!  You can hear the "right" way to pronounce "hwloc".  :-)

Cisco is also hosting some Open MPI/MPI-related short talks in our booth; I 
just posted about this on the Open MPI lists:

http://www.open-mpi.org/community/lists/users/2010/11/14741.php

Drop by the Cisco booth for the exact schedule; we're right next to the main 
SciNet NOC.

See you there!



On Nov 8, 2010, at 11:22 AM, Brice Goglin wrote:

> Hello,
> For those of you going to SC10 @ New Orleans next week, you should know
> that hwloc will be there too. I will be hanging around the INRIA Booth
> (#2751, between TACC and Microsoft) and Jeff Squyres will be near the
> Cisco Booth (#3247, on the other side of Microsoft). Feel free to visit
> us and request new features for hwloc 1.2 :)
> Brice
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] quick question

2010-07-22 Thread Jeff Squyres
On Jul 22, 2010, at 9:20 AM, Rupert Brooks wrote:

> First, i have to report a little bug with the web site - the searching
> of the users list archives seems not to work when i tried to search i
> got the following error message
> 
> Search Hardware Locality [Users] List Archive Index file error: Could
> not open the index file './index.swish-e': No such file or directory

Ouch!  I'll get that fixed...  Thanks for letting us know.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [hwloc-users] hwloc_set/get_thread_cpubind

2010-07-15 Thread Jeff Squyres
Fixed -- thanks for the heads-up!

On Jul 14, 2010, at 2:28 PM, Αλέξανδρος Παπαδογιαννάκης wrote:

> 
> hwloc_set_thread_cpubind and hwloc_get_thread_cpubind are missing from the 
> html documentation
> http://www.open-mpi.org/projects/hwloc/doc/v1.0.1/group__hwlocality__binding.php
>  
> _
> Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
> https://signup.live.com/signup.aspx?id=60969
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] hwloc sockets support on solaris

2010-06-23 Thread Jeff Squyres
On Jun 23, 2010, at 4:29 PM, Brice Goglin wrote:

> Don't you want hwloc_cpuset_set(set, i) instead ?
> hwloc_cpuset_cpu(set, i) changes the cpuset into a single CPU, i.e. it's
> zero(set) + set(set, i).

Ah.  Well, that would do it.  :-)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [hwloc-users] hwloc sockets support on solaris

2010-06-23 Thread Jeff Squyres
Hm.  We should be.  Here's the hwloc plugin code for setting CPU affinity (it's 
static because it's invoked by function pointer):

static int module_set(opal_paffinity_base_cpu_set_t mask)
{
int i, ret = OPAL_SUCCESS;
hwloc_cpuset_t set;
hwloc_topology_t *t = &mca_paffinity_hwloc_component.topology;

set = hwloc_cpuset_alloc();
hwloc_cpuset_zero(set);
for (i = 0; ((unsigned int) i) < OPAL_PAFFINITY_BITMASK_T_NUM_BITS; ++i) {
if (OPAL_PAFFINITY_CPU_ISSET(i, mask) &&
i < mca_paffinity_hwloc_component.cpuset_max_size) {
hwloc_cpuset_cpu(set, i);
}
}

if (0 != hwloc_set_cpubind(*t, set, 0)) {
ret = OPAL_ERR_IN_ERRNO;
}
hwloc_cpuset_free(set);

return ret;
}


On Jun 23, 2010, at 4:14 PM, Brice Goglin wrote:

> I see this in the solaris binding core:
> 
>   if (hwloc_cpuset_weight(hwloc_set) != 1) {
> errno = EXDEV;
> return -1;
>   }
> 
> OMPI doesn't get this error ?
> 
> Brice
> 
> 
> 
> 
> Le 23/06/2010 21:56, Terry Dontje a écrit :
>> Does hwloc think it supports binding processes to sockets or multiple cpus?  
>> I am asking because I believe there are no current Solaris accessors that 
>> support this (processor_bind only binds a pid or a set of pids to a *single* 
>> processor).  
>> 
>> I bring this up because in testing OMPI with hwloc support it looks like 
>> -bind-to-socket is acting like -bind-to-core on Solaris.  I believe the 
>> issue is hwloc should be returning an error to tell OMPI it cannot 
>> bind-to-socket or multiple cpus at that.
>> 
>> -- 
>> 
>> Terry D. Dontje | Principal Software Engineer
>> Developer Tools Engineering | +1.650.633.7054
>> Oracle - Performance Technologies
>> 95 Network Drive, Burlington, MA 01803
>> Email terry.don...@oracle.com
>> -- 
>> 
>> Terry D. Dontje | Principal Software Engineer
>> Developer Tools Engineering | +1.650.633.7054
>> Oracle - Performance Technologies
>> 95 Network Drive, Burlington, MA 01803
>> Email terry.don...@oracle.com
>> 
>> 
>> ___
>> hwloc-users mailing list
>> 
>> hwloc-us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>> 
>>   
>> 
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] [hwloc-devel] hwloc and rpath

2010-06-22 Thread Jeff Squyres
On Jun 22, 2010, at 8:48 AM, Jirka Hladky wrote:

> So basically, until libtool patch will make through upstream into other
> distributions it will be needed to patch configure script.

Bummer.  But at least it's a good explanation and a workaround!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [hwloc-users] [hwloc-devel] hwloc and rpath

2010-06-21 Thread Jeff Squyres
On Jun 21, 2010, at 4:30 PM, Jirka Hladky wrote:

> I'm not sure what's wrong. It seems like libtool is not smart enough to
> recognize /usr/lib64 as default library directory on 64-bit system I have
> asked on Fedora packaging mailing list for advice. I will keep you updated.

Awesome; thanks.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] [hwloc-devel] hwloc and rpath

2010-06-21 Thread Jeff Squyres
On Jun 21, 2010, at 3:18 PM, Jirka Hladky wrote:
> $chrpath --list /usr/local/bin/lstopo
> /usr/local/bin/lstopo: RPATH=/usr/local/lib

Ah, I understand now.  And I'm seeing the same behavior:

$ cd util
$ rm lstopo
$ make V=1
/bin/sh ../libtool  --tag=CC   --mode=link gcc -I/usr/include/cairo 
-I/usr/include/freetype2 -I/usr/include/libpng12   -I/usr/include/libxml2   
-std=gnu99   -fvisibility=hidden  -I/home/jsquyres/hwloc-1.0.1/include 
-L/home/jsquyres/hwloc-1.0.1/src  -o lstopo lstopo-lstopo.o 
lstopo-lstopo-color.o lstopo-lstopo-text.o lstopo-lstopo-draw.o 
lstopo-lstopo-fig.o lstopo-lstopo-cairo.o lstopo-lstopo-xml.o  -lcairo   -lxml2 
-lz -lm   -lm -lncursesw  -lX11 /home/jsquyres/hwloc-1.0.1/src/libhwloc.la
libtool: link: gcc -I/usr/include/cairo -I/usr/include/freetype2 
-I/usr/include/libpng12 -I/usr/include/libxml2 -std=gnu99 -fvisibility=hidden 
-I/home/jsquyres/hwloc-1.0.1/include -o .libs/lstopo lstopo-lstopo.o 
lstopo-lstopo-color.o lstopo-lstopo-text.o lstopo-lstopo-draw.o 
lstopo-lstopo-fig.o lstopo-lstopo-cairo.o lstopo-lstopo-xml.o  
-L/home/jsquyres/hwloc-1.0.1/src -lcairo -lncursesw -lX11 
/home/jsquyres/hwloc-1.0.1/src/.libs/libhwloc.so -lxml2 -lz -lm -Wl,-rpath 
-Wl,/tmp/bogus/lib

...and there's the -rpath in there (my prefix was /tmp/bogus, so it's 
definitely pulling it from /home/jsquyres/hwloc-1.0.1/src/libhwloc.la).  

I tried building SVN with the latest latest latest GNU tools:

AM 1.11.1
AC 2.65
LT 2.2.8
M4 1.4.14

And the same thing happened.  So this is what LT wants to do.  :-\

We cannot be the only project that builds both LT libraries and then 
executables from those libraries.  What do those projects do?

> Please give me a pointer to 1.0.2 version. I will give it a try, perhaps it
> has been already fixed.

My (double) bad -- the current released version is 1.0.1.  We haven't changed 
anything other than Samuels' S/LIBS/LDADD/ stuff.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] [hwloc-devel] hwloc and rpath

2010-06-21 Thread Jeff Squyres
On Jun 21, 2010, at 12:54 PM, Jirka Hladky wrote:

> However, libtool does not look into /usr/lib64 by default. I have found 2 ways
> how to fix it:

Are we installing to /usr/lib64 by default?  Or do you have something in your 
specfile or your system's default that is resetting libdir to /usr/lib64?

FWIW, in the 1.0.2 distribution tarball, I'm not seeing the -rpath argument in 
the final libtool link.  I see it in the almost-final link:

/bin/sh ../libtool  --tag=CC   --mode=link gcc -std=gnu99   -fvisibility=hidden 
-I/usr/include/libxml2   -std=gnu99   -fvisibility=hidden  
-I/home/jsquyres/hwloc-1.0.1/include-no-undefined  -version-number 0:1:0 
-lxml2 -lz -lm-o libhwloc.la -rpath /usr/local/lib topology.lo traversal.lo 
topology-synthetic.lo bind.lo cpuset.lo misc.lo topology-xml.lo  
topology-linux.lo   topology-x86.lo  

But then libtool seems to strip it out:

libtool: link: gcc -shared  .libs/topology.o .libs/traversal.o 
.libs/topology-synthetic.o .libs/bind.o .libs/cpuset.o .libs/misc.o 
.libs/topology-xml.o .libs/topology-linux.o .libs/topology-x86.o   -lxml2 -lz 
-lm-Wl,-soname -Wl,libhwloc.so.0 -o .libs/libhwloc.so.0.1.0

Are you seeing something different?

How does one check to see if rpath was applied to the final 
.libs/libhwloc.so.0.1.0?  I tried objdump and didn't see anything, but I might 
be looking in the wrong place:

$ objdump .libs/libhwloc.so.0.1.0 -x | grep -i path
de27 l F .text  019d  
hwloc_strdup_mntpath
$

> 1) Add  /usr/lib64 into /etc/ld.so.conf. It works like a charm. The problem is
> that I cannot use this change in the build environment (on a cluster of build
> servers for compilation on different architectures)
> 
> Samuel, do you have "/usr/lib64" directory listed in /etc/ld.so.conf listed on
> your 64-bit Debian? If so, I will consider to open Bugzilla to add
> "/usr/lib64" directory into /etc/ld.so.conf on Fedora as well.

FWIW, it's not in my RHEL 5.4:

[11:32] svbu-mpi:~/svn/ompi/ompi/mpi/c % more /etc/ld.so.conf
include ld.so.conf.d/*.conf
[11:32] svbu-mpi:~/svn/ompi/ompi/mpi/c % more /etc/ld.so.conf.d/*
/usr/lib64/qt-3.3/lib
[11:32] svbu-mpi:~/svn/ompi/ompi/mpi/c % 

> 2) Second approach is to add
> sed -i 's|^hardcode_libdir_flag_spec=.*|hardcode_libdir_flag_spec=""|g' 
> libtool
> sed -i 's|^runpath_var=LD_RUN_PATH|runpath_var=DIE_RPATH_DIE|g' libtool
> into the %configure stage in rpm specs.
> 
> I don't like this approach but it seems to be the only way how to get rid of
> rpath on an automatic build system.

This is definitely not a preferred solution; I don't want to get in the 
business of frobbing a generated libtool script (we do it in Open MPI to work 
around esoteric bugs and it's awful awful awful).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] [hwloc-devel] hwloc and rpath

2010-06-21 Thread Jeff Squyres
Sorry; I was on a plane while most of this conversation was occurring on Friday.

I see that Samuel converted to use LDADD instead of LIBS.  Cool.  

I still see -rpath being inserted in the final link step for libhwloc.so (SVN 
build using AC 2.65, AM 1.11.1, LT 2.2.6b):

/bin/sh ../libtool  --tag=CC   --mode=link gcc -std=gnu99   -fvisibility=hidden 
-I/usr/include/libxml2   -std=gnu99   -fvisibility=hidden  
-I/users/jsquyres/svn/hwloc/include -Wall -Wunused-parameter -Wundef 
-Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes 
-Wcomment -pedantic-no-undefined  -version-number 0:0:0 -lxml2 -lz -lm
-o libhwloc.la -rpath /home/jsquyres/bogus/lib topology.lo traversal.lo 
topology-synthetic.lo bind.lo cpuset.lo misc.lo topology-xml.lo  
topology-linux.lo   topology-x86.lo  -libverbs

But unless I'm mistaken, libtool then strips it out:

libtool: link: gcc -shared  .libs/topology.o .libs/traversal.o 
.libs/topology-synthetic.o .libs/bind.o .libs/cpuset.o .libs/misc.o 
.libs/topology-xml.o .libs/topology-linux.o .libs/topology-x86.o   -lxml2 -lz 
-lm -libverbs-Wl,-soname -Wl,libhwloc.so.0 -o .libs/libhwloc.so.0.0.0

Does this latest change make it work acceptably for you?



On Jun 18, 2010, at 7:18 PM, Samuel Thibault wrote:

> Samuel Thibault, le Sat 19 Jun 2010 00:03:49 +0200, a écrit :
> > What is the output of gcc -print-search-dirs?
> 
> Ah, no, I misread the configure script, sys_lib_dlsearch_path_spec
> comes from ld.so.conf (that makes sense actually).  Could you post your
> /etc/ld.so.conf (and any file that it could include)?
> 
> Samuel
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] Getting a graphics view for anon graphic system...

2010-06-09 Thread Jeff Squyres
On Jun 6, 2010, at 4:03 PM, Olivier Cessenat wrote:

> What you write is clear to computer scientists, but I failed to figure
> out what it meant. Sorry, it is clear now !

FWIW, there's a section about "output formats" in the hwloc-ls.1 man page.  
It's probably worth adding a sentence in there that the list in the man page 
applies to the filenames; i.e., that the filename determines the output format.

Here's a snipit from the man page:

OUTPUT FORMATS
   -  Send a text summary to stdout.

   /dev/stdout
  Send a text summary to stdout.  It is effectively  the  same  as
  specifying "-".

   .txt
  If the filename ends in ".txt", lstopo outputs an ASCII art rep-
  resentation of the map.

   -.txt  If the entire filename is "-.txt", lstopo outputs the same ASCII
  art  representation as other ".txt" filenames, but with two exe-
  ceptions: 1) the output is sent to stdout, and 2) if colors  are
  supported on the terminal, the ASCII art will be colorized.

   .fig
  If  the filename ends in ".fig", lstopo outputs a representation
  of the map that can be loaded in Xfig.

   .pdf
  If the filename ends in ".pdf" and lstopo was compiled with  the
  proper  support, lstopo outputs a PDF representation of the map.

   .ps
  If the filename ends in ".ps" and lstopo was compiled  with  the
  proper  support,  lstopo  outputs a Postscript representation of
  the map.

   .png
  If the filename ends in ".png" and lstopo was compiled with  the
  proper  support, lstopo outputs a PNG representation of the map.

   .svg
  If the filename ends in ".svn" and lstopo was compiled with  the
  proper support, lstopo outputs an SVG representation of the map.

   .xml
  If the filename ends in ".xml" and lstopo was compiled with  the
  proper support, lstopo outputs an XML representation of the map.
  It may be reused later, even on  another  machine,  with  lstopo
  --xml,   the   HWLOC_XMLFILE   environment   variable,   or  the
  hwloc_topology_set_xml() function.

   See the output of "lstopo --help" for a specific list of what graphical
   output formats are supported in your hwloc installation.


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] hwloc on systems with more than 64 cpus?

2010-05-17 Thread Jeff Squyres
On May 17, 2010, at 11:41 AM, Jirka Hladky wrote:

> BTW, is there any time-plan for hwloc 1.0 to be released?

There were some trivial changes since rc6; I have one more trivial change to 
make today and then we're probably good to go.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] hwloc on systems with more than 64 cpus?

2010-05-14 Thread Jeff Squyres
I believe that Brice / Samuel (the two main developers) have tested hwloc on an 
old Altix 4700 with 256 itanium cores.  

I don't have their exact results, and I don't see them on IM right now, so I 
don't know if they're around today or not...


On May 14, 2010, at 8:57 AM, Jirka Hladky wrote:

> Hello,
> 
> I have tested hwloc on several systems and I was very impressed with results.
> It's a great tool!
> 
> The biggest box I have tested it on had 64 CPUs. (32 cores + hyper threading
> enabled).
> 
> I wonder if somebody has tested it on box with more than 64 CPUs. If so, can
> you please share your results?
> 
> Thanks a lot!
> Jirka
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[hwloc-users] 1.0rc2

2010-04-26 Thread Jeff Squyres
A bunch of changes/fixes have gone in, and I figured out what was causing 
Badness when trying to embed hwloc into Open MPI.  So I took the liberty of 
rolling 1.0rc2.  Please give it a whirl and let us know how it goes:

http://www.open-mpi.org/~jsquyres/www.open-mpi.org/software/hwloc/v1.0/

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] No caches or hiearchy on RHEL 4.7 or 4.8

2010-01-28 Thread Jeff Squyres
I know that the RHEL 4 kernels are 2.6.9 -- they're really ancient.  Most of 
the topology stuff didn't come into the kernel until 2.6.15 or so.


On Jan 28, 2010, at 1:40 PM, Brice Goglin wrote:

> Dan Eaton wrote:
>> Yes, the cpuid version works marvelously. The only thing it misses is 
>> memory/NUMA node (is it using libnuma?).
>> 
> 
> libnuma reads numa info exactly like hwloc does: from sysfs, i.e. it
> depends of the kernel. I don't know if we could try to read SRAT tables
> from userspace when the kernel exports wrong NUMA info...
> 
> Brice
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


-- 
Jeff Squyres
jsquy...@cisco.com




[hwloc-users] Hardware Locality (hwloc) v0.9.3 released

2009-12-01 Thread Jeff Squyres
The Hardware Locality (hwloc) team is pleased to announce the release  
of v0.9.3:


  http://www.open-mpi.org/projects/hwloc/
  (mirrors will update shortly)

hwloc provides command line tools and a C API to obtain the  
hierarchical map of key computing elements, such as: NUMA memory  
nodes, shared caches, processor sockets, processor cores, and  
processor "threads".  hwloc also gathers various attributes such as  
cache and memory information, and is portable across a variety of  
different operating systems and platforms.


hwloc v0.9.3 is a bug fix release.  The following is a summary of  
changes as compared to v0.9.2:


* Fix autogen.sh to work with Autoconf 2.63.
* Fix various crashes in particular conditions:
  - xml files with root attributes
  - offline CPUs
  - partial sysfs support
  - unparseable /proc/cpuinfo
  - ignoring NUMA level while Misc level have been generated
* Tweak documentation a bit
* Do not require the pthread library for binding the current thread on  
Linux
* Do not erroneously consider the sched_setaffinity prototype is the  
old version

  when there is actually none.
* Fix _syscall3 compilation on archs for which we do not have the
  sched_setaffinity system call number.
* Fix AIX binding.
* Fix libraries dependencies: now only lstopo depends on libtermcap, fix
  binutils-gold link
* Have make check always build and run hwloc-hello.c
* Do not limit size of a cpuset.

*** Note that the hwloc project represents the merger of the  
libtopology project from INRIA and the Portable Linux Processor  
Affinity (PLPA) sub-project from Open MPI.  *Both of these prior  
projects are now deprecated.*


--
Jeff Squyres
jsquy...@cisco.com



[hwloc-users] libtopology tarballs posted

2009-09-14 Thread Jeff Squyres
FYI: Because the libtopology web site will disappear someday, and  
because we haven't released a version of hwloc yet, I posted the  
libtopology v0.9 tarballs here:


http://www.open-mpi.org/software/hwloc/v0.9/

--
Jeff Squyres
jsquy...@cisco.com