LONGNET: Scaling Transformers to 1,000,000,000 Token <https://arxiv.org/pdf/2307.02486.pdf>
In this work, we successfully scale the sequence length to 1 billion > tokens. Our solution is LONGNET, which replaces the attention of vanilla > Transformers with a novel component named dilated attention. The general > design principle is - attention allocation decreases exponentially as the > distance between tokens grows. We prove that it obtains a linear > computation complexity and a logarithm dependency between tokens. This > deals with the contradiction between limited attention resources and the > accessibility to every token. In the implementation, LONGNET can be > transformed into a dense Transformer, which seamlessly supports the > off-the-shelf optimization for transformers (e.g., kernel fusion, > quantization, and distributed training). Taking advantage of the linear > complexity, LONGNET can parallelize the training across nodes, breaking the > constraint of both computation and memory with a distributed algorithm. > This allows us to efficiently scale up the sequence length to *1B tokens > with nearly constant runtime* (see Figure 5), while vanilla Transformer > suffers from quadratic complexity. 1B tokens more than 10x the content of the Wikipedia corpus in the Large Text Compression Benchmark. Given that *within the attention context window* transformer attention mechanisms operate at a higher level in the Chomsky Hierarchy of expressivity, this could result in a vastly improved model of human knowledge than anything now contemplated. But I'm skeptical. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T4353b67b92739646-Md2a06f9a991a9d54aeec233c Delivery options: https://agi.topicbox.com/groups/agi/subscription
