LONGNET: Scaling Transformers to 1,000,000,000 Token
<https://arxiv.org/pdf/2307.02486.pdf>

In this work, we successfully scale the sequence length to 1 billion
> tokens. Our solution is LONGNET, which replaces the attention of vanilla
> Transformers with a novel component named dilated attention. The general
> design principle is - attention allocation decreases exponentially as the
> distance between tokens grows. We prove that it obtains a linear
> computation complexity and a logarithm dependency between tokens. This
> deals with the contradiction between limited attention resources and the
> accessibility to every token. In the implementation, LONGNET can be
> transformed into a dense Transformer, which seamlessly supports the
> off-the-shelf optimization for  transformers (e.g., kernel fusion,
> quantization, and distributed training). Taking advantage of the  linear
> complexity, LONGNET can parallelize the training across nodes, breaking the
> constraint of both computation and memory with a distributed algorithm.
> This allows us to efficiently scale up the sequence length to *1B tokens
> with nearly constant runtime* (see Figure 5), while vanilla Transformer
> suffers from quadratic complexity.


1B tokens more than 10x the content of the Wikipedia corpus in the Large
Text Compression Benchmark.  Given that *within the attention context
window* transformer attention mechanisms operate at a higher level in the
Chomsky Hierarchy of expressivity, this could result in a vastly improved
model of human knowledge than anything now contemplated.

But I'm skeptical.

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T4353b67b92739646-Md2a06f9a991a9d54aeec233c
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to