Re: [datameet] Views on DIA, Scraping, API Restrictions

Nikhil VJ Wed, 26 Jul 2023 23:38:52 -0700

Hi Tarunima

I've not delved too deep into this stuff, but want to share some inputs
about one problem that might be one root cause for these changes. Apologies
in advance if this is under-informed:

Problem
There's a major cost factor in terms of bandwidth consumption for any kind
of situation where you have one centralized distribution point that is
being queried for data by thousands to millions of consumers.
This cost is recurring, plus there's a slab-like cost : to be ABLE to serve
some data at huge numbers, you have to have bought/rented some pretty
expensive infrastructure / services. That will result in fixed costs. And
that is regardless of whether anyone actually consumes your data or not.

And that is just the bandwidth part. In cases where the data has to be
fetched out of a database, even that needs to bear really high loads, in
addition to doing its main job of intaking and maintaining the data. And
things get more complex and expensive as we dive into multi-cluster
databases etc.

When people in the research community talk about data access, I don't see
anybody even mentioning this stuff. It's mostly "You have to open these
endpoints with zero restrictions and I will scrape it at max speed and
that's that!". They don't consider that there will be others like them.
Imagine all the cars in a city deciding to go to a mall at the same time -
that's what keeps happening here.

----------

Solutions

1. Torrents
Before the current state of high-speed internet access came about, there
was one solution that was being used to get around this problem : Bit
Torrent. Basically, anyone who is downloading some data, can also become a
relayer of those chunks of data to other consumers. So the consumption is
spread out.
This is for fixed static files that wouldn't need to update over time, not
databases with updating data. Even today, if we go for downloading some
linux distro, we'll see that they share .torrent / magnet links and recco
that way to download, which relieves the load on their servers which are
mostly volunteer-funded.

While this might not be suitable for dynamically updating data, but for
data dumps that have a fixed version / release date, why not? We should see
government / institutions releasing torrents the same way linux distros do
it. And there's enough technical institutions and companies with hefty
servers in the country that can seed these torrents, just like how they're
seeding linux distros today.

For dynamically updating data, database-type queries, now there's web3
technologies coming up that might help for similar decentralised
distribution - I don't know more details but it's worth digging into.

-----------
2. Consumer pays model
Another solution I've seen recently is a consumer-pays-for-egress model
used in Amazon's Openstreetmap data release. Here's one link, I couldn't
find the exact article explaining it: https://registry.opendata.aws/
Under this model, the dataset is available as open and is query-able,
meaning you can fetch just the parts you want. But if you as a consumer
want to fetch a high amount of it, then you have to pay the bandwidth costs
incurred.
And if you consume less, then there might be a free slab you come under
(not sure what is the case in AWS case)

Provided we retain a basic free tier, I think this takes care of a lot of
problems. Now I wouldn't want India's governments to be a beholden AWS
customer (because USA etc already are and there's national security
considerations that we SHOULD take seriously, not scoff at), but rather
like what has been done with UPI, there should be Indian-owned
infrastructure and service which offers the same deal. Maybe a prepaid
wallet that gets deducted from when we exceed the free tier.

---------
3. Rate limiting
Many times it's not the quantity, but the velocity of data scraping that
inflicts high costs on the provider. If scrapers scraped data slowly
overnight instead of trying to fetch everything in 10 minutes at peak
business hours, we might make things work with the same existing setup used
to serve data to sites.
Example : The main concern for say Indian Railways in serving train
schedules data would be : the server should not get so clogged up by these
scraping bots that people who are trying to book tickets get downtime.
Enforcing rate limits can help a lot here. A basic figure: Allow an IP or a
user to make max 4 requests per minute. I've recently implemented this
using Kong Gateway, and was surprised by how easy it was.

In many cases I'm suspecting that the people in charge had no idea that
rate-limiting is even possible, so they went to next option: captcha
restrictions to disable automated data fetching entirely. Funnily, that's a
far more expensive measure than rate-limiting! And we have an arms race now
with scrapers cracking that captcha and then providers making it so
difficult to read that eventually humans won't be able to read it anymore.
Maybe we can put down our weapons, take a few steps back and communicate
that there are options available that work for both sides?

--------
4. Load shifting
One load-shifting example I've seen in my netbanking : If I request for
some long-term account statements, instead of trying to give the data
immediately, it queues the task in backend and tells me to carry on and
come back in a few mins to download. Some other sites mail me the link when
they're done gathering the data requested. So, we could have something like
this : it distributes the load on server from peak-time spikes to the
"lazy" times later when there's not much high traffic. The institution can
stay on a cheaper infrastructure, doesn't incur higher costs, and it still
accomplishes the goal of providing data.

--------

A mental note:
I came across a quote the other day : A government that is expected to do
everything for us, will take everything from us.
Too often I see this expectational attitude amongst folks like "everything
must be provided for, free and openly accessible!", without taking into
consideration what it all takes or the fact that they're not the only
scrapers on the planet. Or the fact that there will always be an unlimited
supply of idiots hogging up all the resources for no use other than to show
off their latest Go code's concurrency stats.
I agree that we're paying taxes, but those taxes are already accounted for
(very inefficiently, but yes), and demanding that they be used to fund all
these new web infra for giving us more free stuff leads to the same
convenient outcome : increase in our taxes. It's already happening, and
it's not sustainable. I especially don't appreciate having to pay more
taxes just for the sake of that idiot with the Go code :D.

And you can bet that if we demand that so and so government institution
make everything freely available without caring about the details, then
they will do it in the most expensive and wasteful way imaginable. The
solutions I've written above seem a lot better to me than increasing my
taxes plus creating yet another black hole in the govt budget.
I don't know exactly how we can make things work out, but if we ditched the
expectational attitude and think more as a team player with the government
being part of that team, we could go a long way.

--
Cheers,
Nikhil VJ
https://nikhilvj.co.in

On Fri, Jul 21, 2023 at 12:18 PM Tarunima <[email protected]> wrote:

> Hi All,
>
> There have been a number of recent changes globally and nationally, in
> part triggered by Gen AI chat bots, that could restrict scraping. Listing
> the salient ones here:
>
>    - India coming up with the Digital India Act which would bear on data
>    access
>    - EU releasing the Data Services Act which has provisions for
>    researcher access to data (
>    https://algorithmwatch.org/en/dsa-data-access-explained/)
>    - Twitter and Reddit revoking API access:
>    https://www.fastcompany.com/90904038/reddit-restricts-third-party-apps
>
> Does this group have any views/concerns about this?
>
> Apologies if this has already been addressed in another thread.
>
> Regards,
> Tarunima
>
>
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/datameet/CACj7W2mAOhV4-Qi0Y6g4BqWUFUXCWtxQpyd1xewRpxwafXEb0g%40mail.gmail.com
> <https://groups.google.com/d/msgid/datameet/CACj7W2mAOhV4-Qi0Y6g4BqWUFUXCWtxQpyd1xewRpxwafXEb0g%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/datameet/CAH7jeuMfUEDNwpqbvBJtt5YAzcpff9FW1stJGoDNmVJu%2BssmsA%40mail.gmail.com.

Re: [datameet] Views on DIA, Scraping, API Restrictions

Reply via email to