> The full weight files weigh in at 308033580802 bytes (286.88 GiB). > The slim weight files, which usually means precision is reduced to > float16 (sometimes float8), weight in at 41112854242 bytes (38.29 > GiB).
Just a note that I might be wrong here about what full and slim mean. > > Traditionally the entire model is loaded into VRAM to evaluate it, > although it can also be streamed in and out or distributed across > multiple machines with some hacks. There is additional overhead than > just the weights, and significantly additional overhead if the model > is further being trained for a specific task. Can also add that people have been training models on low-end hardware by tracing and training only a subset of the parameters at once. Traditionally all are trained at once. Systems also support a form of checkpointing that discards and regenerates the derivatives when needed, as I've mentioned in a spamlog somewhere.
