I emailed the paper author above, he said no, not suitable for the NLP task, only will work on image data (hence GPT-2 is agnostic/general):
"Thanks for your interest. The key novelty in I-GPT is the transformer, which utilizes multiple head attention models to get global information for scene/NLP understanding. Hence, they can generate correct content for arbitrary inputs. However, their model needs expensive computing costs and high memory space as the transformer store the global relationship of each key. Our model utilizes the CNN structure, the attention model is only used in one layer for copying information from visible regions. I don’t think this CNN-Based structure is suitable for the NLP task." ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T48eb73fe225c230b-M029f284fb93f6c8e24efb17d Delivery options: https://agi.topicbox.com/groups/agi/subscription
