On 11/4/2018 12:18 PM, Matt Jadud wrote:
I have some code that is unhappy.  I suspect I'm running into an OS-level resource limit.

I'm working with an Intel Phi machine running CentOS that reports 256 cores. It is built using the Intel Xeon Phi 7210, which suggests that it has four, 64-core processors. I compiled Racket from source, but I don't think that makes a difference here.

I've parallelized some code using places, and seem to be OK when (< num-places 96). I've used channels (synchronous and asynchronous in places, as well as two, semaphore-protected queues) internally for the queen bee process. I have around N:M places-to-threads in the queen, and my code is not *intentionally* non-deterministic.

Are you using in-process places or distributed places?   In-process places are just OS threads in the same process.  Distributed places can be launched in/as separate processes, but then each process would have its own set of file descriptors.


The worker bees process messages in-and-out on their place-channels. Each worker holds 2 connections to two databases.

What DBMS(es)?  In-process DBMS like SQLite use file descriptors, but client/server DBMS use network connections (which don't count as open files).


(I'm wondering if compiled .zos of libraries are counting towards the number of open files each place holds on to...)

IIRC, bytecode files are memory mapped, and that requires the file be kept open.  But even if every file is mapped into every place, you'd need a lot of code files to exhaust 4K descriptors ... if it is being done smartly [???], there would only be 1 descriptor needed per file.


When I run with 64 places, things are OK. When I run with 96 places, things seem OK (code runs to completion). When I run with 128 places, things are not OK.

My current best guess is that I'm running into a max number of allowed open file descriptors. I don't have root on the system, so I can't easily change this, but I thought I'd throw this to the list, and see if anyone has any further thoughts as to what I might look for. Given that it takes a while to spin up all 128 of the places, I suspect things look like they're running fine... until enough of the places spin up, I run out of descriptors (I suspect), and then all kinds of badness begins.

It does appear that you're running out of file descriptors, although even if you have a single process with many places it's hard to see how you could be using 4K of them if files are getting closed properly.  Place channels are internal to Racket, and AFAIK they don't use file descriptors.  Distributed place channels use TCP ports, and obviously you can run out of ports, but they wouldn't count as open files (unless the error message is mistaken).

Without a whole lot more information the only thought that occurs is to have each place force a major GC after it finishes a work unit. If something is not being closed properly by the code, then it might be cleaned up by the GC.

George


[mjadud@phi data]$ ulimit -Ha

core file size(blocks, -c) unlimited

data seg size (kbytes, -d) unlimited

scheduling priority (-e) 0

file size (blocks, -f) unlimited

pending signals (-i) 385394

max locked memory (kbytes, -l) 64

max memory size (kbytes, -m) unlimited

*open files(-n) 4096*

pipe size(512 bytes, -p) 8

POSIX message queues (bytes, -q) 819200

real-time priority(-r) 0

stack size(kbytes, -s) unlimited

cpu time (seconds, -t) unlimited

max user processes(-u) 385394

virtual memory(kbytes, -v) unlimited

file locks(-x) unlimited

[mjadud@phi data]$ ulimit -Sn

1024


------ When Things Die ------

open-input-file: cannot open input file

path: /usr/netapp/faculty/mjadud/racket-src/racket/share/pkgs/cldr-bcp47/cldr/bcp47/data/timezone.xml

system error: Too many open files; errno=24

context...:

/usr/netapp/faculty/mjadud/racket-src/racket/collects/racket/private/kw-file.rkt:102:2: call-with-input-file*61

"/usr/netapp/faculty/mjadud/racket-src/racket/share/pkgs/cldr-bcp47/cldr/bcp47/timezone.rkt": [running body]

temp37_0

for-loop

run-module-instance!125

for-loop

[repeats 1 more time]

run-module-instance!125

for-loop

[repeats 1 more time]

run-module-instance!125

for-loop

[repeats 1 more time]

run-module-instance!125

for-loop

[repeats 1 more time]

...

instantiate: unknown module

module name: #<resolved-module-path:(submod 'typed-racket/private/type-contract.rkt[8709188] predicates)>

context...:

namespace-module-instantiate!96

for-loop

[repeats 1 more time]

run-module-instance!125

for-loop

[repeats 1 more time]

run-module-instance!125

for-loop

[repeats 1 more time]

run-module-instance!125

for-loop

[repeats 1 more time]

run-module-instance!125

for-loop

[repeats 1 more time]

run-module-instance!125

...


standard-module-name-resolver: collection not found

for module path: typed-racket/utils/hash-contract

collection: "typed-racket/utils"

in collection directories:

/usr/netapp/faculty/mjadud/.racket/development/collects

/usr/netapp/faculty/mjadud/racket-src/racket/collects

context...:

open-input-file: cannot open input file

path: /usr/netapp/faculty/mjadud/racket-src/racket/share/pkgs/cldr-core/cldr/compiled/core_rkt.zo

system error: Too many open files; errno=24 show-collection-err

standard-module-name-resolver

namespace-module-instantiate!96

for-loop

[repeats 1 more time]

run-module-instance!125

perform-require!78

for-loop

[repeats 1 more time]

add-lifted-require!

do-local-lift-to-module48

syntax-local-lift-require

/usr/netapp/faculty/mjadud/racket-src/racket/share/pkgs/typed-racket-lib/typed-racket/utils/redirect-contract.rkt:32:2: redirect


context...:

default-load-handler

[repeats 1 more time]

standard-module-name-resolver

apply-transformer-in-context

namespace-module-instantiate!96

for-loop

apply-transformer52

...

[repeats 1 more time]

run-module-instance!125

for-loop

[repeats 1 more time]

run-module-instance!125

for-loop

[repeats 1 more time]

run-module-instance!125

for-loop

[repeats 1 more time]

run-module-instance!125

for-loop

...


--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to