Re: [ml] langchain runs local model officially

efc Thu, 06 Apr 2023 14:43:21 -0700

Sadly that's way beyond my capabilities, but I take away from this thatdevelopment is continuing that surely, there will be better quality modelsfor smaller computers available in the coming months. =)


On Thu, 6 Apr 2023, Undescribed Horrific Abuse, One Victim & Survivor of Many 
wrote:

people have been quantizing models using
https://github.com/qwopqwop200/GPTQ-for-LLaMa and uploading the models
to huggingface.co

they can be pruned smaller using sparsegpt, which has some forks for
llama, but a little dev work is needed to do the pruning in a way that
is useful, presently the lost weights are just set to zero. it would
make sense to alter the algorithm such that entire matrix columns and
rows can be excised (see https://github.com/EIDOSLAB/simplify for
ideas) or to use a purpose selected dataset and severely increase the
sparsification (per
https://scholar.google.com/scholar?as_ylo=2023&q=lottery+tickets
pruning even random data may actually be more effective than normal
methods for training models)

the newer fad is https://github.com/FMInference/FlexGen which i don't
believe has been ported to llama yet but is not complex, notably
applies 10% sparsity in attention but i don't believe it prunes

and the latest version of pytorch has some hardcoded accelerated
memory reduced attention algorithms that could likely be almost drop
in replacements for huggingface's manual attention, mostly useful when
training longer contexts,
https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

Re: [ml] langchain runs local model officially

Reply via email to