Hi all,

I have some code that is unhappy.  I suspect I'm running into an OS-level
resource limit.

I'm working with an Intel Phi machine running CentOS that reports 256
cores. It is built using the Intel Xeon Phi 7210, which suggests that it
has four, 64-core processors. I compiled Racket from source, but I don't
think that makes a difference here.

I've parallelized some code using places, and seem to be OK when (<
num-places 96). I've used channels (synchronous and asynchronous in places,
as well as two, semaphore-protected queues) internally for the queen bee
process. I have around N:M places-to-threads in the queen, and my code is
not *intentionally* non-deterministic.

The worker bees process messages in-and-out on their place-channels. Each
worker holds 2 connections to two databases. (I'm wondering if compiled
.zos of libraries are counting towards the number of open files each place
holds on to...)

When I run with 64 places, things are OK. When I run with 96 places, things
seem OK (code runs to completion). When I run with 128 places, things are
not OK.

My current best guess is that I'm running into a max number of allowed open
file descriptors. I don't have root on the system, so I can't easily change
this, but I thought I'd throw this to the list, and see if anyone has any
further thoughts as to what I might look for. Given that it takes a while
to spin up all 128 of the places, I suspect things look like they're
running fine... until enough of the places spin up, I run out of
descriptors (I suspect), and then all kinds of badness begins.

[mjadud@phi data]$ ulimit -Ha

core file size          (blocks, -c) unlimited

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 385394

max locked memory       (kbytes, -l) 64

max memory size         (kbytes, -m) unlimited

open files                      (-n) 4096

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) unlimited

cpu time               (seconds, -t) unlimited

max user processes              (-u) 385394

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited

[mjadud@phi data]$ ulimit -Sn

1024

When running with 96 places:
...

[mjadud@phi data]$ ls -l /proc/$(pidof racket)/fd | wc -l

873

# This is a steady state.


Keeping an eye on things when running with 128 places:
...

[mjadud@phi data]$ ls -l /proc/$(pidof racket)/fd | wc -l

983

[mjadud@phi data]$ ls -l /proc/$(pidof racket)/fd | wc -l

999


# <!-- somewhere around here, things went badly.

# It never achieves a steady state.


[mjadud@phi data]$ ls -l /proc/$(pidof racket)/fd | wc -l

1006

[mjadud@phi data]$ ls -l /proc/$(pidof racket)/fd | wc -l

1011

[mjadud@phi data]$ ls -l /proc/$(pidof racket)/fd | wc -l

1014

[mjadud@phi data]$ ls -l /proc/$(pidof racket)/fd | wc -l

1019

[mjadud@phi data]$ ls -l /proc/$(pidof racket)/fd | wc -l

1014

[mjadud@phi data]$ ls -l /proc/$(pidof racket)/fd | wc -l

Thoughts appreciated,
Matt

--- Machine ---
cat /proc/os-release

NAME="CentOS Linux"

VERSION="7 (Core)"

ID="centos"

ID_LIKE="rhel fedora"

VERSION_ID="7"

PRETTY_NAME="CentOS Linux 7 (Core)"

ANSI_COLOR="0;31"

CPE_NAME="cpe:/o:centos:centos:7"

HOME_URL="https://www.centos.org/";

BUG_REPORT_URL="https://bugs.centos.org/";


CENTOS_MANTISBT_PROJECT="CentOS-7"

CENTOS_MANTISBT_PROJECT_VERSION="7"

REDHAT_SUPPORT_PRODUCT="centos"

REDHAT_SUPPORT_PRODUCT_VERSION="7"

------ When Things Die ------

open-input-file: cannot open input file

  path:
/usr/netapp/faculty/mjadud/racket-src/racket/share/pkgs/cldr-bcp47/cldr/bcp47/data/timezone.xml

  system error: Too many open files; errno=24

  context...:

   
/usr/netapp/faculty/mjadud/racket-src/racket/collects/racket/private/kw-file.rkt:102:2:
call-with-input-file*61

   
"/usr/netapp/faculty/mjadud/racket-src/racket/share/pkgs/cldr-bcp47/cldr/bcp47/timezone.rkt":
[running body]

   temp37_0

   for-loop

   run-module-instance!125

   for-loop

   [repeats 1 more time]

   run-module-instance!125

   for-loop

   [repeats 1 more time]

   run-module-instance!125

   for-loop

   [repeats 1 more time]

   run-module-instance!125

   for-loop

   [repeats 1 more time]

   ...

instantiate: unknown module

  module name: #<resolved-module-path:(submod
'typed-racket/private/type-contract.rkt[8709188] predicates)>

  context...:

   namespace-module-instantiate!96

   for-loop

   [repeats 1 more time]

   run-module-instance!125

   for-loop

   [repeats 1 more time]

   run-module-instance!125

   for-loop

   [repeats 1 more time]

   run-module-instance!125

   for-loop

   [repeats 1 more time]

   run-module-instance!125

   for-loop

   [repeats 1 more time]

   run-module-instance!125

   ...


standard-module-name-resolver: collection not found

  for module path: typed-racket/utils/hash-contract

  collection: "typed-racket/utils"

  in collection directories:

   /usr/netapp/faculty/mjadud/.racket/development/collects

   /usr/netapp/faculty/mjadud/racket-src/racket/collects

  context...:

open-input-file: cannot open input file

  path:
/usr/netapp/faculty/mjadud/racket-src/racket/share/pkgs/cldr-core/cldr/compiled/core_rkt.zo

  system error: Too many open files; errno=24   show-collection-err

   standard-module-name-resolver

   namespace-module-instantiate!96

   for-loop

   [repeats 1 more time]

   run-module-instance!125

   perform-require!78

   for-loop

   [repeats 1 more time]

   add-lifted-require!

   do-local-lift-to-module48

   syntax-local-lift-require

   
/usr/netapp/faculty/mjadud/racket-src/racket/share/pkgs/typed-racket-lib/typed-racket/utils/redirect-contract.rkt:32:2:
redirect


  context...:

   default-load-handler

   [repeats 1 more time]

   standard-module-name-resolver

   apply-transformer-in-context

   namespace-module-instantiate!96

   for-loop

   apply-transformer52

   ...

   [repeats 1 more time]

   run-module-instance!125

   for-loop

   [repeats 1 more time]

   run-module-instance!125

   for-loop

   [repeats 1 more time]

   run-module-instance!125

   for-loop

   [repeats 1 more time]

   run-module-instance!125

   for-loop

   ...

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to