karl3@writeme.com wrote:
> karl3@writeme.com wrote:
> > karl3@writeme.com wrote:
> > there's some interest in 'downloading only top k items' [this involves 
> > looking at the layer algebra [and coming up with ways to identify 
> > low-contributing values.
> > [[we have solved this before possibly/optionally including preprocessing to 
> > categorize things
> > top k is more fun! it seems niftier to make ummmmmmmmm
> > so we've got some input logits. these are probably getting multiplied by a 
> > _huge_ matrix.
> > we could technically do a naiveish approach of discarding the parts that 
> > are multiplied by values near zero. (we could actually consider that each 
> > dot product has large values and small values, and skip all values that are 
> > smaller than a percentage of the largest values.)
> > - this works much better if we find a way to clump the mask based on 
> > locality :/ since http servers like to send regions of bytes not sparse 
> > masks
> > - this is really cool if we make like a bayesian or error-labeled datatype, 
> > so instead of 3.4 it's more like 3.4+-0.31 this would give much more useful 
> > information at the end
> > but yeah it seems interesting to just try the mask! involves some simple 
> > torch kernel algebra
> > there's a small space here where one can get the _same exact output_ by 
> > predicting that some products would be smaller than the precision of the 
> > sum ... this might at least need information on the magnitude of the 
> > weights unsure ... ... but there are likely heuristics one could apply here 
> > that would be accurate because of the rote nature of the training process 
> > and possibly a lack of useful+accurate information one would expect from an 
> > overtiny number multiplied by an overlarge one ...
> 
> that's kind of more in line with the intent of httptransformer and 
> llm_logits, to be able to work on things like that on your cellphone, but i 
> didn't make llm_logits for this model
> 
> ummm i guess i'll look a little at matmul

 97             number_passes = math.ceil(weight.mem_usage_frac())
 98             if number_passes == 1:
 99                 product = torch.matmul(input, weight.fetch(progress=name, 
validate_usage=False).T)
100             else:
101                 rows_at_once = math.ceil(weight.shape[0] / number_passes)
102  ->             product = torch.cat([
103                     torch.matmul(
104                         input,
105                         weight[offset : 
offset+rows_at_once].fetch(progress=f'row{offset}-{offset+rows_at_once}/{weight.shape[0]}',
 validate_usage=False).T
106                     )
107                     for offset in tqdm.tqdm(range(
(Pdb) p input.shape
torch.Size([1, 6, 16384])
(Pdb) p input.abs().max(dim=-1)
torch.return_types.max(
values=tensor([[1.5359, 0.2287, 0.1609, 0.1848, 0.1869, 0.2321]], 
dtype=torch.float64),
indices=tensor([[ 6303,  6303, 14427, 14427, 14427,   205]]))
(Pdb) p input.abs().min(dim=-1)
torch.return_types.min(
values=tensor([[1.0807e-08, 1.3109e-07, 3.4837e-07, 7.3625e-08, 1.0437e-06, 
7.8622e-08]],
       dtype=torch.float64),
indices=tensor([[   78,  6285, 10787,  5347, 10964, 15229]]))
(Pdb) p input.dtype
torch.float64

so yes it's a float64 but 1e-6 is still a lot smaller than 0.1, aren't you 
curious what would happen if we only multiplied the largest 16 values? or the 
largest 1024?

Reply via email to