Riccardo Mottola wrote: > Everything started working again after a couple of hours.
Makes me think it was simply overloading. > For info, my current IP ... I do not find your address in any of the current bans. > AI scrapers are becoming a plague much worse than crawlers. One reason more > I dislike this AI hype. > I am part of other Open Source projects and we all have more or less similar > problems. > It usually affects wikis, CVS/SVN/GIT browsers, bug trackers and such. It's an asymetrical problem. Small projects are small. They have one public facing web server. They start off as an idea with no resources. They face having a hundred million proxy bots hitting them from well funded AI start-up companies currently none of which have posted a profit but are getting funded with a zillion dollars in startup money. > I was alarmed that SVN repos were down. For svn note that there is also svnserve:// protocol. It's on a different port. It might be surviving when the https port is overwhelmed. As you saw that ssh:// protocol was also available during the same time. There are not yet any mirrors available for the subversion repositories. Yet. It's in the plan. Just need server resources and then time to get them set up. Things continue to evolve. Maybe we set up a new port that secures with an http basic password? In order to have a known good member access protocol service. Maybe. The HTTP protocol has really opened up the Internet. It seems that everything is using it. And therefore people have in their mind that it is the best way to do things. But HTTP almost always means a resource heavy back end server to be associated with it. Let's say someone has a 1 GB RAM virtual machine for a server. That's a typical size for a lot of uses. It used to be more than sufficient. Let's say that a backend process uses 50 MB for the process size to do something. That's not very large these days. Doing the math... If I didn't make some silly error... That's only 20 backend server processes that can be run consuming all memory. We must limit the number of backend processes that can be spawned to 20. A botnet AI scraper hammers the machine with 500 queries per second. That will cause all 20 backend processes to be 100% busy. Let's say those servers have 16 cores. That's still going to keep all of the cpu cores 100% busy. In practice there will be I/O wait time for storage preventing it. The machine load average will rise to 20, plus a few more for other miscellaneous processes associated. So maybe level at a load of 25. And it will be completely max'd out. It will be operating at its capacity. But the botnet with millions of proxy clients available will have plenty more resources to keep the system overwhelmed. And even if we scaled that 1 GB to 256 GB that only increases the number of bots that could be served linearly. It is not a big enough scaling and the botnets would still overwhelm us. Also remember that these AI scrapers are not doing svn checkouts. They are web browsing the repository as if it were a web page, downloading each of the hash numbered object files without regard. This is a useless wasted activity for them. But they don't care. They have seemingly endless resources right now. They should be doing an svn check and then scraping their own local sandbox. But instead since the repository is available by https:// it is now being uselessly scraped. Regardless of the robots.txt files saying not to do it to these directories. I propose this as an illustration that using HTTP for everything and treating everything as a web protocol has this unintended consequence that it allows for the horde of scrapers to damage us. If it were a different dedicated protocol then this would not be happening at this level of abuse. The svnserve:// and git:// protocol services have so far not been the main target of abuse. Bob
