Dear Nettimers,
honoring the institutionalized format, I'm posting this speculative text
in the hope for comments.
best
Francis
@databasecultures@dair-community.social / www.irmielin.org
*** Spamming the Data Space – CLIP, GPT and synthetic data ***
** Introduction **
For the last time in human history the cultural-data space has not been
contaminated. In recent years a new technique to acquire knowledge has
emerged. Scraping the Internet and extracting information and data has
become a new modus for companies and for university researchers in the
field of machine learning. One of the currently largest publicly
available training data sets to combine images and labels (which shall
describe the images content), is Laion-5B, with 5,85 billion image-text
pairs (Ilharco, Gabriel et al. 2021).[1]
The scope of scraping internet resources has become so all-encompassing,
that researcher Eva Cetinic has proposed to call this form ‘cultural
snapshot’: “By encoding numerous associations which exist between data
items collected at a certain point in time, those models therefore
represent synchronic assemblages of cultural snapshots, embedded in a
specific technological framework. Metaphorically those models can be
considered as some sort of encapsulation of the collective (un)conscious
[…]” (Cetinic 2022).[2] The important suggestion which Cetinic makes, is
that these data collections are temporally anchored. The temporal
dimension of these snapshots suggests that digital cultural snapshots
taken at different times document different states of (online-)culture.
So how will a 2021 snapshot differ from a 2031 cultural snapshot?
** Consequences **
Multi-modal models, like CLIP, trained on large-scale data sets, such as
LAION-5B provide the statistical means to generate images from text
prompts. In the CLIP Model, pre-trained models merge two embedding
spaces, one for images and one for text-descriptions which with
mathematical methods get layered together, so that the vectors in the
one space, the image domain, align with vectors in the other space, the
text domain, assuming there is a similarity between both, and one can
translate into the other. In three short examples I’ll discuss some of
the consequences of the underlying data for large-scale models from the
perspective of cultural snapshots.
1.) Data Bias: A critical discussion of these large-scale multi-modal
models for instance, has pointed out how they are culturally skewed and
reproduce sexist and racist biases. Researchers Fabian Offert and Thao
Phan, for instance, describe how the company Open AI decided not to
mitigate the problem of whiteness by changing the model’s underlying
data. Instead, Open AI added certain invisible keywords to users’
prompts to have more people of color included, without changing the
model. Obviously, the calculations for creating these models or even
curating the underlying data are so tremendous that for economic reasons
even major problems cannot be corrected in the embedding space itself.
Discussing the prevalent ‘whiteness’ in these models further, Offert and
Phan suggest to turn to humanities in order to “identify the different
technical modes of whiteness at play, and understand the
reconceptualization and resurrection of whiteness as a machinic concept”
(Offert and Phan 2022, 3).[3]
2.) Uneven spatial distribution: Users of large-scale multi-modal models
have tested their limits when generating images. ‘Crungus’, and ‘Loab’
are two examples. ‘Loab’, the image of a women appeared when AI artist
Supercomposite looked for the negative of a prompt: “DIGITA PNTICS
skyline logo::-1”. Loab appears to be a consistent pixel accumulation,
which repeatedly emerges in different configurations and cannot easily
be traced back to a single origin.[4] The creator/discoverer of ‘Loab’
felt during intensive testing, that Loab might exist in its own pocket,
because it was relatively reproducible, compared to other prompts, as if
it was populating a certain statistical region within the larger latent
space. Another, similar phenomenon of uneven spatial distribution in
latent space is ‘Crungus’, basically a phantasy word which as a prompt
nevertheless created results: a snarling, zombie-like figure with
shoulder-long hair, which could be part of a horror movie.[5]
Both examples demonstrate that the cultural snapshots also contain
material which cannot be easily identified or traced back and they
demonstrate, how the latent space is an uneven spatial distribution by
design. Since the models are built by a process called zero shot
learning in difference to for instance the supervised learning used in
ImageNet, there are no longer intentional ontologies used in the
knowledge creation of these models. The human involvement involves the
uncoordinated captioning of images by users online, and the setting up
the scraping algorithms and excluding certain domains from being scraped
by researchers.
3.) Data Spam: Looking at the history of spam it has emerged whenever a
business case of creating large amount of messages using copy-and-paste
could be made. Email spam, forum spam, comment spam, video spam on
YouTube has been common and consistent over the past decades. Hand in
hand with spam goes Search Engine Optimization (SEO), which optimizes
content for discoverability by knowledge aggregators, namely search
engines. The text-generator like GPT-3 has already proven to be an
annoyance when users of one of the central online forums for programmers
Stack Overflow began to flood it with automated comments. It turned out,
that many generated answers proved incorrect but not easily discernable:
“The primary problem is that while the answers which ChatGPT produces
have a high rate of being incorrect, they typically look like they might
be good and the answers are very easy to produce” (Stack Overflow
moderators in: Vincent 2022). This is only one example of many, and it
will extend from text, image and video generation and will become a
major problem on Instagram, Flickr, Pinterest, and many other visual
platforms. Possible applications for data spam are fake-news, subversive
messages, or advertisement and so on.
Further, synthetic text spam or synthetic image spam using statistical
tools like GPT, or CLIP produces results which will be evaluated by the
same or similar machine learning architectures, and therefore may be
more conform to the mathematical models than organically human produced
content.
All in all, this poses the question, how to assess any online content
after 2021.
** Data Ecologies **
While some may argue that generated text and images will save time and
money for businesses, a data ecological view immediately recognizes a
major problem: AI feeds into AI. To rephrase it: statistical computing
feeds into statistical computing. In using these models and publishing
the results online we are beginning to create a loop of prompts and
results, with the results being fed into the next iteration of the
cultural snapshots. That’s why I call the early cultural snapshots still
uncontaminated, and I expect the next iterations of cultural snapshots
will be contaminated. In the long term this may lead to a deterioration
of the quality of the appropriated data. It also opens the opportunity
for data spamming. Spammers or search engine optimizers may decide to
create huge amounts of picture and captions to create a stronger
presence for a certain product or cause.
These are the conditions under which such large image collections become
available at all: the extraction of the unpaid labor of those who
published the images originally online. Both the extractive nature and
the very likely future contamination of cultural snapshots will make
this approach untenable and unsustainable in the long run.
* Sources *
Baio, Andy. 2022. “AI Data Laundering – How Academic and Nonprofit
Researchers Shield Tech Companies from Accountability.” Blog. Waxy.Org
(blog). September 30, 2022.
https://waxy.org/2022/09/ai-data-laundering-how-academic-and-nonprofit-researchers-shield-tech-companies-from-accountability/.
Birhane, Abeba, Vinay Uday Prabhu, and Emmanuel Kahembwe. 2021.
“Multimodal Datasets: Misogyny, Pornography, and Malignant Stereotypes.”
arXiv. https://doi.org/10.48550/arXiv.2110.01963.
Cetinic, Eva. 2022. “The Myth of Culturally Agnostic AI Models.” Arxiv,
November, 4. https://doi.org/10.48550/arXiv.2211.15271.
Ilharco, Gabriel, Wortsman, Mitchell, Carlini, Nicholas, Taori, Rohan,
Dave, Achal, Shankar, Vaishaal, Namkoong, Hongseok, et al. 2021.
“OpenCLIP.” Hamburg: Laion e.V. Zenodo.
https://doi.org/10.5281/ZENODO.5143773.
Kelly [@Brainmage], Guy. 2022. “Well I REALLY Don’t like How Similar All
These Pictures of ‘Crungus’, ….” Tweet. Twitter.
https://twitter.com/Brainmage/status/1538111384390619136.
Lavoipierre, Ange. 2022. “There’s a Woman Haunting the Internet. She Was
Created by AI. Now She Won’t Leave.” ABC News, November 25, 2022.
https://www.abc.net.au/news/2022-11-26/loab-age-of-artificial-intelligence-future/101678206.
Offert, Fabian, and Thao Phan. 2022. “A Sign That Spells: DALL-E 2,
Invisual Images and The Racial Politics of Feature Space.”
ArXiv:2211.06323 [Cs], October. http://arxiv.org/abs/2211.06323.
Supercomposite [@supercomposite]. 2022. “🧵: I Discovered This Woman,
Who I Call Loab, in April. ….” Tweet. Twitter.
https://twitter.com/supercomposite/status/1567162288087470081.
Vincent, James. 2022. “AI-Generated Answers Temporarily Banned on Coding
Q&A Site Stack Overflow.” The Verge. December 5, 2022.
https://www.theverge.com/2022/12/5/23493932/chatgpt-ai-generated-answers-temporarily-banned-stack-overflow-llms-dangers.
Weisbuch, Max, Sarah A. Lamer, Evelyne Treinen, and Kristin Pauker.
2017. “Cultural Snapshots – Theory and Method.” Social and Personality
Psychology Compass 11 (9). https://doi.org/10.1111/spc3.12334.
* Footnotes *
[1] LAION is organized as an independent German research association.
This division of labor between smaller and larger actors, who shift
responsibility away from the large companies which use the models based
on these data collections has been criticized by in AI Data Laundering:
How Academic and Nonprofit Researchers Shield Tech Companies from
Accountability (Baio 2022).
[2] Cetinic borrows this concept from social and cultural psychology
studies, referring to Cultural snapshots – Theory and method (Weisbuch
et al. 2017).
[3] C.f. Multimodal datasets: misogyny, pornography, and malignant
stereotypes (Birhane, Prabhu, and Kahembwe 2021).
[4] C.f. (Supercomposite [@supercomposite] 2022; Lavoipierre 2022).
Note: I wasn’t able to reproduce Loab with my installation of Stable
Diffusion v1.
[5] It was first produced using DALL-E mini in June 2022 by actor and
comedian Guy Kelly, first reported on twitter (Kelly [@Brainmage] 2022).
First published Dec. 7 2022 at
https://databasecultures.irmielin.org/spamming-the-data-space-clip-gpt-and-synthetic-data/
# distributed via <nettime>: no commercial use without permission
# <nettime> is a moderated mailing list for net criticism,
# collaborative text filtering and cultural politics of the nets
# more info: http://mx.kein.org/mailman/listinfo/nettime-l
# archive: http://www.nettime.org contact: nett...@kein.org
# @nettime_bot tweets mail w/ sender unless #ANON is in Subject: