[LINK] AI .. Google Says It Will Scrape Publishers’ Data for AI Unless Forced Not To

Stephen Loosley Sat, 12 Aug 2023 21:59:27 -0700

Google Says It Will Scrape Publishers’ Data for AI Unless Forced Not To

Google’s AI mining-by-default proposal to the Australian government comes a 
month after the company declared it would scrape all the internet's data.


By Kyle Barr, Published Wednesday 11:15AM  
https://gizmodo.com/google-bard-ai-scrape-websites-data-australia-opt-out-1850720633


Google hungers for all that content produced by the wealth of digital 
publishers creating text, video, and images on a daily basis.

To deal with the sticky copyright issues at the heart of AI training, Google is 
proposing that all those companies who don’t want their content gobbled up will 
need to “opt-out” to ensure Google’s open maw doesn’t swallow all their juicy 
data.

The tech giant offered this raw deal to the Australian government in response 
to the country’s recent proposal to ban “high-risk” AI applications, including 
creating deepfakes, disinformation, and discrimination.

As first reported by The Guardian, Google shared that publishers should have 
the ability to say no to whether their content is copied for the purpose of 
training AI.

Google released its Bard chatbot in the land down under back in May, and since 
then, the company has been trying to entice the country into allowing it to 
scrape ever more data.

Google has already written to the Australian government over relaxing copyright 
laws to allow more AI training.

Now it’s being open about establishing an AI-friendly internet that allows 
scraping by default.

The proposal would force publishers both big and small to educate themselves 
about the opt-out and establish it on their own sites rather than putting the 
onus on Google.

The company did not explicitly say how this opt-out function would work, and 
Google did not immediately respond to Gizmodo’s request for comment. In a July 
blog post, Google called for new “standards and protocols” about how web 
publishers participate in the internet.

The company pointed to the 30-year-old, community-developed robots.txt 
standard, a protocol that indicates to web crawlers and bots which portions of 
a site they’re allowed to visit.

Of course, that robots.txt protocol only works with nice bots that agree to 
comply voluntarily. It doesn’t impede any company that decides not to obey the 
standard. Plus, it doesn’t take back any data that was already scraped without 
publishers’ consent.

Google has multiple large language models, including its recently announced 
PaLM 2. Google’s Bard chatbot was originally based on the LaMDA LLM, and 
researchers have noted that 50% of its content comes from public forums while a 
good chunk of it is scraped from Wikipedia and other websites.

It’s not just publishers that Google is looking to scrape, it’s the entire 
internet writ large.

Recently, Google updated its privacy policy to explicitly allow the company to 
use everything you post online to be used in developing its AI tools. Shortly 
after Gizmodo was first to spot the policy change, Google was hit with a class 
action lawsuit claiming the company scraped up copyrighted material without 
consent.

ChatGPT creator OpenAI has been hit with a very similar lawsuit over its 
alleged abuse of copyright.

Essentially, these companies have already scraped up massive amounts of the 
internet to train their models. So much of the data is already based on 
Wikipedia entries and Reddit posts, but these models also make use of articles, 
books, and other online text.

Just consider that the GPT-4 language model is trained on 45 terabytes of data, 
so there’s a bounty of published material locked inside. OpenAI has its own 
designs on industry-friendly regulation, and it has called for a whole new 
federal agency meant to oversee the tech. Google, on the other hand, has 
lobbied against that proposal.

Google’s opt-out idea wouldn’t be localized to just Australia, of course. The 
company has been trying to court the largest news organizations like The New 
York Times and The Washington Post with new AI tools, all while trying to infer 
its A-OK if they scrape up all those published articles for use training their 
AI.

--




_______________________________________________
Link mailing list
[email protected]
https://mailman.anu.edu.au/mailman/listinfo/link

[LINK] AI .. Google Says It Will Scrape Publishers’ Data for AI Unless Forced Not To

Reply via email to