Roi Martin wrote:
> First of all, sorry for the late reply.  Some personal stuff has kept me
> away from the computer for quite some time.

No worries!  I am also juggling life, the universe, and everything.
We are all getting by as best as we can.  For example I am writing
this on a train while out on travel right now.

> However, with regard to "https://cgit.git.savannah.gnu.org/cgit/"; and
> "https://gitweb.git.savannah.gnu.org/gitweb/";, you said that these URLs
> are very likely to change and should not be used in documentation.
>
> Bob Proulx writes:
>   >     https://cgit.git.savannah.gnu.org//cgit/
>   >     https://gitweb.git.savannah.gnu.org//gitweb/
>   >
>   > Those names and paths might need to change but for the moment maybe
>   > having them even temporarily will give some relief?  I would avoid
>   > using them in documentation which will persist because again they are
>   > likely not stable naming yet.  You have been warned.

Yes.  I did say that.  But it has been a year now and I think things
have solidified.  In particular the above convention helps with the
combinations of permutations of redirects that are needed.  For one
example the folowing is rather programmatic and regular.

    location /git { return 302 $scheme://$scheme.$server_name$request_uri; }

> Now that all redirects seem to be implemented:
>
>   $ curl -isS https://git.savannah.gnu.org/git/ | grep Location
>   Location: https://https.git.savannah.gnu.org/git/
>   $ curl -isS https://git.savannah.gnu.org/cgit/ | grep Location
>   Location: https://cgit.git.savannah.gnu.org/cgit/
>   $ curl -isS https://git.savannah.gnu.org/gitweb/ | grep Location
>   Location: https://gitweb.git.savannah.gnu.org/gitweb/
>
> What are the recommended URLs to use in documentation?
> "https://{https,cgit,gitweb}.git.savannah.gnu.org"; or
> "https://git.savannah.gnu.org";?

That is a good question.  I know people prefer shorter ones (Look at
the shorter debbugs URLs for an example) and the redirects will then
go to the best place.  But that does require the primary to be online
and available and fails if it is not.  Whereas the longer URLs to the
mirror pool will work independently of the primary.

With the redirects in place the load on the primary has been reduced
hugely.  Basically reduced from 40+ with 100% of 8-cpus down to an
average of 2 with idle cpu available.  And so perhaps I no longer feel
strongly about it one way or the other until and unless some new
problem arises.

Do you have an opinion?  What do you think?

> While I was using curl to check the previous redirects,
> git.savannah.gnu.org stopped answering my requests.  However, it worked
> again when tethering from my phone.  Out of curiosity, do you
> temporarily blacklist hosts showing a suspicious behaviour?  For
> instance, using curl instead of a browser or git to access these URLs?
> Maybe this is what happened last time?

In the battle against the AI scrapers taking the site offline we are
blocklisting addresses very dynamically.  We are blocking at multiple
levels and layers.  The site router has blocks, mostly blocking entire
ASN network blocks.  The individual systems in the primary pool of
core servers block on each machine.  (These are currently not
coordinated but likely will be in the future.)  The mirror pool
systems have their own firewalls completely independent of the primary
systems.  So there is a lot of specific things that apply to one
machine separately from the others.

For obvious reasons no one wants to discuss publicly specific details
of how the dynamic blocking rules are being created and implemented.
No one wants to give a scraper vender clues about how to blast through
their defenses.  But what I can say are that the problem scrapers do
not obey the robots.txt file ignoring it completely and falsify their
User-Agent strings to appear as much as they can like user web
browsers.  This makes it tricky to separate the humans we want to have
access from the scrapers which should simply clone the repositories
and have full access to the free software source code that way rather
than hammering on a site until it goes offline.

So that's the long answer.  The shorter answer is yes.  Yes your
address might have been blocked by tripping one of the rules.  If you
tell us your address (you can seen it privately to the private
[email protected] list) then we can look to see if so
and why.  And we would need to know which system you were connecting
to as well.  Since as described they are both independent and coupled
and we would need to look both places.

Bob

Reply via email to