earlier
https://discord.com/channels/823813159592001537/1051442019035254784/1057764315656093806 re modern i was thinking things like linear transformer, rwkv, holographic hrrformer, s4d or sgconv, the new “mega” model … something tuned for long context with less ram. hrrformer says it is competitive with only 1 model layer. i’d also use adapters to speed training and provide more accessibility.
