Spamming the Data Space – CLIP, GPT and synthetic data

Francis Hunger Mon, 19 Dec 2022 01:54:07 -0800

Dear Nettimers,

honoring the institutionalized format, I'm posting this speculative textin the hope for comments.


best

Francis

@[email protected] / www.irmielin.org



*** Spamming the Data Space – CLIP, GPT and synthetic data ***
** Introduction **

For the last time in human history the cultural-data space has not beencontaminated. In recent years a new technique to acquire knowledge hasemerged. Scraping the Internet and extracting information and data hasbecome a new modus for companies and for university researchers in thefield of machine learning. One of the currently largest publiclyavailable training data sets to combine images and labels (which shalldescribe the images content), is Laion-5B, with 5,85 billion image-textpairs (Ilharco, Gabriel et al. 2021).[1]The scope of scraping internet resources has become so all-encompassing,that researcher Eva Cetinic has proposed to call this form ‘culturalsnapshot’: “By encoding numerous associations which exist between dataitems collected at a certain point in time, those models thereforerepresent synchronic assemblages of cultural snapshots, embedded in aspecific technological framework. Metaphorically those models can beconsidered as some sort of encapsulation of the collective (un)conscious[…]” (Cetinic 2022).[2] The important suggestion which Cetinic makes, isthat these data collections are temporally anchored. The temporaldimension of these snapshots suggests that digital cultural snapshotstaken at different times document different states of (online-)culture.So how will a 2021 snapshot differ from a 2031 cultural snapshot?


** Consequences **

Multi-modal models, like CLIP, trained on large-scale data sets, such asLAION-5B provide the statistical means to generate images from textprompts. In the CLIP Model, pre-trained models merge two embeddingspaces, one for images and one for text-descriptions which withmathematical methods get layered together, so that the vectors in theone space, the image domain, align with vectors in the other space, thetext domain, assuming there is a similarity between both, and one cantranslate into the other. In three short examples I’ll discuss some ofthe consequences of the underlying data for large-scale models from theperspective of cultural snapshots.

1.) Data Bias: A critical discussion of these large-scale multi-modalmodels for instance, has pointed out how they are culturally skewed andreproduce sexist and racist biases. Researchers Fabian Offert and ThaoPhan, for instance, describe how the company Open AI decided not tomitigate the problem of whiteness by changing the model’s underlyingdata. Instead, Open AI added certain invisible keywords to users’prompts to have more people of color included, without changing themodel. Obviously, the calculations for creating these models or evencurating the underlying data are so tremendous that for economic reasonseven major problems cannot be corrected in the embedding space itself.Discussing the prevalent ‘whiteness’ in these models further, Offert andPhan suggest to turn to humanities in order to “identify the differenttechnical modes of whiteness at play, and understand thereconceptualization and resurrection of whiteness as a machinic concept”(Offert and Phan 2022, 3).[3]

2.) Uneven spatial distribution: Users of large-scale multi-modal modelshave tested their limits when generating images. ‘Crungus’, and ‘Loab’are two examples. ‘Loab’, the image of a women appeared when AI artistSupercomposite looked for the negative of a prompt: “DIGITA PNTICSskyline logo::-1”. Loab appears to be a consistent pixel accumulation,which repeatedly emerges in different configurations and cannot easilybe traced back to a single origin.[4] The creator/discoverer of ‘Loab’felt during intensive testing, that Loab might exist in its own pocket,because it was relatively reproducible, compared to other prompts, as ifit was populating a certain statistical region within the larger latentspace. Another, similar phenomenon of uneven spatial distribution inlatent space is ‘Crungus’, basically a phantasy word which as a promptnevertheless created results: a snarling, zombie-like figure withshoulder-long hair, which could be part of a horror movie.[5]

Both examples demonstrate that the cultural snapshots also containmaterial which cannot be easily identified or traced back and theydemonstrate, how the latent space is an uneven spatial distribution bydesign. Since the models are built by a process called zero shotlearning in difference to for instance the supervised learning used inImageNet, there are no longer intentional ontologies used in theknowledge creation of these models. The human involvement involves theuncoordinated captioning of images by users online, and the setting upthe scraping algorithms and excluding certain domains from being scrapedby researchers.

3.) Data Spam: Looking at the history of spam it has emerged whenever abusiness case of creating large amount of messages using copy-and-pastecould be made. Email spam, forum spam, comment spam, video spam onYouTube has been common and consistent over the past decades. Hand inhand with spam goes Search Engine Optimization (SEO), which optimizescontent for discoverability by knowledge aggregators, namely searchengines. The text-generator like GPT-3 has already proven to be anannoyance when users of one of the central online forums for programmersStack Overflow began to flood it with automated comments. It turned out,that many generated answers proved incorrect but not easily discernable:“The primary problem is that while the answers which ChatGPT produceshave a high rate of being incorrect, they typically look like they mightbe good and the answers are very easy to produce” (Stack Overflowmoderators in: Vincent 2022). This is only one example of many, and itwill extend from text, image and video generation and will become amajor problem on Instagram, Flickr, Pinterest, and many other visualplatforms. Possible applications for data spam are fake-news, subversivemessages, or advertisement and so on.

Further, synthetic text spam or synthetic image spam using statisticaltools like GPT, or CLIP produces results which will be evaluated by thesame or similar machine learning architectures, and therefore may bemore conform to the mathematical models than organically human producedcontent.

All in all, this poses the question, how to assess any online contentafter 2021.


** Data Ecologies **

While some may argue that generated text and images will save time andmoney for businesses, a data ecological view immediately recognizes amajor problem: AI feeds into AI. To rephrase it: statistical computingfeeds into statistical computing. In using these models and publishingthe results online we are beginning to create a loop of prompts andresults, with the results being fed into the next iteration of thecultural snapshots. That’s why I call the early cultural snapshots stilluncontaminated, and I expect the next iterations of cultural snapshotswill be contaminated. In the long term this may lead to a deteriorationof the quality of the appropriated data. It also opens the opportunityfor data spamming. Spammers or search engine optimizers may decide tocreate huge amounts of picture and captions to create a strongerpresence for a certain product or cause.

These are the conditions under which such large image collections becomeavailable at all: the extraction of the unpaid labor of those whopublished the images originally online. Both the extractive nature andthe very likely future contamination of cultural snapshots will makethis approach untenable and unsustainable in the long run.



* Sources *

Baio, Andy. 2022. “AI Data Laundering – How Academic and NonprofitResearchers Shield Tech Companies from Accountability.” Blog. Waxy.Org(blog). September 30, 2022.https://waxy.org/2022/09/ai-data-laundering-how-academic-and-nonprofit-researchers-shield-tech-companies-from-accountability/.Birhane, Abeba, Vinay Uday Prabhu, and Emmanuel Kahembwe. 2021.“Multimodal Datasets: Misogyny, Pornography, and Malignant Stereotypes.”arXiv. https://doi.org/10.48550/arXiv.2110.01963.Cetinic, Eva. 2022. “The Myth of Culturally Agnostic AI Models.” Arxiv,November, 4. https://doi.org/10.48550/arXiv.2211.15271.Ilharco, Gabriel, Wortsman, Mitchell, Carlini, Nicholas, Taori, Rohan,Dave, Achal, Shankar, Vaishaal, Namkoong, Hongseok, et al. 2021.“OpenCLIP.” Hamburg: Laion e.V. Zenodo.https://doi.org/10.5281/ZENODO.5143773.Kelly [@Brainmage], Guy. 2022. “Well I REALLY Don’t like How Similar AllThese Pictures of ‘Crungus’, ….” Tweet. Twitter.https://twitter.com/Brainmage/status/1538111384390619136.Lavoipierre, Ange. 2022. “There’s a Woman Haunting the Internet. She WasCreated by AI. Now She Won’t Leave.” ABC News, November 25, 2022.https://www.abc.net.au/news/2022-11-26/loab-age-of-artificial-intelligence-future/101678206.Offert, Fabian, and Thao Phan. 2022. “A Sign That Spells: DALL-E 2,Invisual Images and The Racial Politics of Feature Space.”ArXiv:2211.06323 [Cs], October. http://arxiv.org/abs/2211.06323.Supercomposite [@supercomposite]. 2022. “🧵: I Discovered This Woman,Who I Call Loab, in April. ….” Tweet. Twitter.https://twitter.com/supercomposite/status/1567162288087470081.Vincent, James. 2022. “AI-Generated Answers Temporarily Banned on CodingQ&A Site Stack Overflow.” The Verge. December 5, 2022.https://www.theverge.com/2022/12/5/23493932/chatgpt-ai-generated-answers-temporarily-banned-stack-overflow-llms-dangers.Weisbuch, Max, Sarah A. Lamer, Evelyne Treinen, and Kristin Pauker.2017. “Cultural Snapshots – Theory and Method.” Social and PersonalityPsychology Compass 11 (9). https://doi.org/10.1111/spc3.12334.


* Footnotes *

[1] LAION is organized as an independent German research association.This division of labor between smaller and larger actors, who shiftresponsibility away from the large companies which use the models basedon these data collections has been criticized by in AI Data Laundering:How Academic and Nonprofit Researchers Shield Tech Companies fromAccountability (Baio 2022).[2] Cetinic borrows this concept from social and cultural psychologystudies, referring to Cultural snapshots – Theory and method (Weisbuchet al. 2017).[3] C.f. Multimodal datasets: misogyny, pornography, and malignantstereotypes (Birhane, Prabhu, and Kahembwe 2021).[4] C.f. (Supercomposite [@supercomposite] 2022; Lavoipierre 2022).Note: I wasn’t able to reproduce Loab with my installation of StableDiffusion v1.[5] It was first produced using DALL-E mini in June 2022 by actor andcomedian Guy Kelly, first reported on twitter (Kelly [@Brainmage] 2022).

First published Dec. 7 2022 athttps://databasecultures.irmielin.org/spamming-the-data-space-clip-gpt-and-synthetic-data/

#  distributed via <nettime>: no commercial use without permission
#  <nettime>  is a moderated mailing list for net criticism,
#  collaborative text filtering and cultural politics of the nets
#  more info: http://mx.kein.org/mailman/listinfo/nettime-l
#  archive: http://www.nettime.org contact: [email protected]
#  @nettime_bot tweets mail w/ sender unless #ANON is in Subject:

Spamming the Data Space – CLIP, GPT and synthetic data

Reply via email to