Dear Compile Farm Users,
We are happy to announce the immediate availability of two
NVIDIA DGX Spark GB10 machines with 20C and 128GB memory; they
are cfarm107:2107 and cfarm108:2108. The announcement is here:
https://portal.cfarm.net/news/56
This email is to provide additional information about how these
machines have been configured, how we intend to adapt them, and
how they should (and should not) be used. TL;DR: don't abuse it.
These shared machines are intended for development and testing
of software within relevant ecosystems, and are not to be used
as "free" computing devices for model training or inferencing
for personal gain. We will monitor usage and advise accordingly.
Our top priority is to ensure that everyone has fair access to
these unique resources. Note that privacy is not guaranteed;
please do not use any Compile Farm machines for sensitive tasks.
We intend to allow users to make full use of the software and
hardware, including Docker containers, and are evaluating the
potential for an unprivileged ("rootless") Docker configuration
that retains GPU acceleration while preserving system integrity
and matching the original out-of-box experience as closely as
possible. Unprivileged Docker use has not been configured yet.
We will not be publicly forwarding unprivileged ports; please
use ssh -L to access any local services you run and be careful
to not store or process any sensitive information on these
machines. Be aware that conflicts may occur if "default" ports
are used by multiple users. Configure your services accordingly.
The official platform User Guide is here:
https://docs.nvidia.com/dgx/dgx-spark/index.html
Given the rapidly evolving ecosystem, we will be applying system
updates regularly with little or no notice depending on the
nature of the updates. If anyone is going to be conducting jobs
that span multiple days or require advance notice, please write
to the appropriate mailing list so that we can pause updates.
It is possible to connect two of these machines together via a
high speed data cable. We will not be doing so at this time.
One open question is whether and how to store large models to
avoid duplication as well as eating up all of the disk space. If
we store popular models in /opt/cfarm/ (for example) then we
will not know when models are no longer used, and they will stay
around longer than needed. High capacity network storage is not
available at this time; bandwidth is cheaper than storage.
Another open question is how the OOM killer works on a system
with unified (CPU+GPU) memory, and how to prioritize stability.
Please report any issues to the mailing list(s) or privately. We
welcome discussion about how to best configure these machines so
that they are of maximum use to all users.
Sincerely,
Compile Farm Admins
_______________________________________________
cfarm-users mailing list
[email protected]
https://lists.tetaneutral.net/listinfo/cfarm-users