On 11/4/2018 12:18 PM, Matt Jadud wrote:
I have some code that is unhappy. I suspect I'm running into an
OS-level resource limit.
I'm working with an Intel Phi machine running CentOS that reports 256
cores. It is built using the Intel Xeon Phi 7210, which suggests that
it has four, 64-core processors. I compiled Racket from source, but I
don't think that makes a difference here.
I've parallelized some code using places, and seem to be OK when (<
num-places 96). I've used channels (synchronous and asynchronous in
places, as well as two, semaphore-protected queues) internally for the
queen bee process. I have around N:M places-to-threads in the queen,
and my code is not *intentionally* non-deterministic.
Are you using in-process places or distributed places? In-process
places are just OS threads in the same process. Distributed places can
be launched in/as separate processes, but then each process would have
its own set of file descriptors.
The worker bees process messages in-and-out on their place-channels.
Each worker holds 2 connections to two databases.
What DBMS(es)? In-process DBMS like SQLite use file descriptors, but
client/server DBMS use network connections (which don't count as open
files).
(I'm wondering if compiled .zos of libraries are counting towards the
number of open files each place holds on to...)
IIRC, bytecode files are memory mapped, and that requires the file be
kept open. But even if every file is mapped into every place, you'd
need a lot of code files to exhaust 4K descriptors ... if it is being
done smartly [???], there would only be 1 descriptor needed per file.
When I run with 64 places, things are OK. When I run with 96 places,
things seem OK (code runs to completion). When I run with 128 places,
things are not OK.
My current best guess is that I'm running into a max number of allowed
open file descriptors. I don't have root on the system, so I can't
easily change this, but I thought I'd throw this to the list, and see
if anyone has any further thoughts as to what I might look for. Given
that it takes a while to spin up all 128 of the places, I suspect
things look like they're running fine... until enough of the places
spin up, I run out of descriptors (I suspect), and then all kinds of
badness begins.
It does appear that you're running out of file descriptors, although
even if you have a single process with many places it's hard to see how
you could be using 4K of them if files are getting closed properly.
Place channels are internal to Racket, and AFAIK they don't use file
descriptors. Distributed place channels use TCP ports, and obviously
you can run out of ports, but they wouldn't count as open files (unless
the error message is mistaken).
Without a whole lot more information the only thought that occurs is to
have each place force a major GC after it finishes a work unit. If
something is not being closed properly by the code, then it might be
cleaned up by the GC.
George
[mjadud@phi data]$ ulimit -Ha
core file size(blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 385394
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
*open files(-n) 4096*
pipe size(512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority(-r) 0
stack size(kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes(-u) 385394
virtual memory(kbytes, -v) unlimited
file locks(-x) unlimited
[mjadud@phi data]$ ulimit -Sn
1024
------ When Things Die ------
open-input-file: cannot open input file
path:
/usr/netapp/faculty/mjadud/racket-src/racket/share/pkgs/cldr-bcp47/cldr/bcp47/data/timezone.xml
system error: Too many open files; errno=24
context...:
/usr/netapp/faculty/mjadud/racket-src/racket/collects/racket/private/kw-file.rkt:102:2:
call-with-input-file*61
"/usr/netapp/faculty/mjadud/racket-src/racket/share/pkgs/cldr-bcp47/cldr/bcp47/timezone.rkt":
[running body]
temp37_0
for-loop
run-module-instance!125
for-loop
[repeats 1 more time]
run-module-instance!125
for-loop
[repeats 1 more time]
run-module-instance!125
for-loop
[repeats 1 more time]
run-module-instance!125
for-loop
[repeats 1 more time]
...
instantiate: unknown module
module name: #<resolved-module-path:(submod
'typed-racket/private/type-contract.rkt[8709188] predicates)>
context...:
namespace-module-instantiate!96
for-loop
[repeats 1 more time]
run-module-instance!125
for-loop
[repeats 1 more time]
run-module-instance!125
for-loop
[repeats 1 more time]
run-module-instance!125
for-loop
[repeats 1 more time]
run-module-instance!125
for-loop
[repeats 1 more time]
run-module-instance!125
...
standard-module-name-resolver: collection not found
for module path: typed-racket/utils/hash-contract
collection: "typed-racket/utils"
in collection directories:
/usr/netapp/faculty/mjadud/.racket/development/collects
/usr/netapp/faculty/mjadud/racket-src/racket/collects
context...:
open-input-file: cannot open input file
path:
/usr/netapp/faculty/mjadud/racket-src/racket/share/pkgs/cldr-core/cldr/compiled/core_rkt.zo
system error: Too many open files; errno=24 show-collection-err
standard-module-name-resolver
namespace-module-instantiate!96
for-loop
[repeats 1 more time]
run-module-instance!125
perform-require!78
for-loop
[repeats 1 more time]
add-lifted-require!
do-local-lift-to-module48
syntax-local-lift-require
/usr/netapp/faculty/mjadud/racket-src/racket/share/pkgs/typed-racket-lib/typed-racket/utils/redirect-contract.rkt:32:2:
redirect
context...:
default-load-handler
[repeats 1 more time]
standard-module-name-resolver
apply-transformer-in-context
namespace-module-instantiate!96
for-loop
apply-transformer52
...
[repeats 1 more time]
run-module-instance!125
for-loop
[repeats 1 more time]
run-module-instance!125
for-loop
[repeats 1 more time]
run-module-instance!125
for-loop
[repeats 1 more time]
run-module-instance!125
for-loop
...
--
You received this message because you are subscribed to the Google Groups "Racket
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.