Re: [Evergreen-general] Question about search engine bots & DB CPU spikes

2021-12-03 Thread JonGeorg SageLibrary via Evergreen-general
The DorkBot queries I'm referring to look like this:
[02/Dec/2021:12:08:13 -0800] "GET
/eg/opac/results?do_basket_action=Go&query=1&detail_record_view=1&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword%27%22&fg%3Amat_format=1&locg=176&sort=1
HTTP/1.0" 200 62417 "-" "UT-Dorkbot/1.0"

they vary after metabib, but all are using the basket feature. They come
from different library branch URLs.
-Jon

On Fri, Dec 3, 2021 at 10:45 AM JonGeorg SageLibrary <
jongeorg.sagelibr...@gmail.com> wrote:

> Yeah, I'm not seeing any /opac/extras/unapi requests in the Apache logs.
> Is DorkBot used legitimately for querying the opac?
> -Jon
>
> On Fri, Dec 3, 2021 at 10:37 AM JonGeorg SageLibrary <
> jongeorg.sagelibr...@gmail.com> wrote:
>
>> Thank you!
>> -Jon
>>
>> On Fri, Dec 3, 2021 at 8:10 AM Blake Henderson via Evergreen-general <
>> evergreen-general@list.evergreen-ils.org> wrote:
>>
>>> JonGeorg,
>>>
>>> This reminds me of a similar issues that we had. We resolved it with
>>> this change to NGINX. Here's the link:
>>>
>>>
>>> https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/blake/LP1913610_nginx_request_limits
>>>
>>> and the bug:
>>> https://bugs.launchpad.net/evergreen/+bug/1913610
>>>
>>> I'm not sure that it's the same issue though, as you've shared a search
>>> SQL query and this solution addresses external requests to
>>> "/opac/extras/unapi"
>>> But you might be able to apply the same nginx rate limiting technique
>>> here if you can detect the URL they are using.
>>>
>>> There is a tool called "apachetop" which I used in order to see the
>>> URL's that were being used.
>>>
>>> apt-get -y install apachetop && apachetop -f
>>> /var/log/apache2/other_vhosts_access.log
>>>
>>> and another useful command:
>>>
>>> cat /var/log/apache2/other_vhosts_access.log | awk '{print $2}' | sort |
>>> uniq -c | sort -rn
>>>
>>> You have to ignore (not limit) all the requests to the Evergreen gateway
>>> as most of that traffic is the staff client and should (probably) not be
>>> limited.
>>>
>>> I'm just throwing some ideas out there for you. Good luck!
>>>
>>> -Blake-
>>> Conducting Magic
>>> Can consume data in any format
>>> MOBIUS
>>>
>>> On 12/2/2021 9:07 PM, JonGeorg SageLibrary via Evergreen-general wrote:
>>>
>>> I tried that and still got the loopback address, after restarting
>>> services. Any other ideas? And the robots.txt file seems to be doing
>>> nothing, which is not much of a surprise. I've reached out to the people
>>> who host our network and have control of everything on the other side of
>>> the firewall.
>>> -Jon
>>>
>>>
>>> On Wed, Dec 1, 2021 at 3:57 AM Jason Stephenson  wrote:
>>>
 JonGeorg,

 If you're using nginx as a proxy, that may be the configuration of
 Apache and nginx.

 First, make sure that mod_remote_ip is installed and enabled for Apache
 2.

 Then, in eg_vhost.conf, find the 3 lines the begin with
 "RemoteIPInternalProxy 127.0.0.1/24" and uncomment them.

 Next, see what header Apache checks for the remote IP address. In my
 example it is "RemoteIPHeader X-Forwarded-For"

 Next, make sure that the following two lines appear in BOTH "location
 /"
 blocks in the ngins configuration:

  proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
  proxy_set_header X-Forwarded-Proto $scheme;

 After reloading/restarting nginx and Apache, you should start seeing
 remote IP addresses in the Apache logs.

 Hope that helps!
 Jason


 On 12/1/21 12:53 AM, JonGeorg SageLibrary wrote:
 > Because we're behind a firewall, all the addresses display as
 127.0.0.1.
 > I can talk to the people who administer the firewall though about
 > blocking IP's. Thanks
 > -Jon
 >
 > On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via
 Evergreen-general
 > >>> > > wrote:
 >
 > JonGeorg,
 >
 > Check your Apache logs for the source IP addresses. If you can't
 find
 > them, I can share the correct configuration for Apache with Nginx
 so
 > that you will get the addresses logged.
 >
 > Once you know the IP address ranges, block them. If you have a
 > firewall,
 > I suggest you block them there. If not, you can block them in
 Nginx or
 > in your load balancer configuration if you have one and it allows
 that.
 >
 > You may think you want your catalog to show up in search engines,
 but
 > bad bots will lie about who they are. All you can do with
 misbehaving
 > bots is to block them.
 >
 > HtH,
 > Jason
 >
 > On 11/30/21 9:34 PM, JonGeorg SageLibrary via Evergreen-general
 wrote:
 >  > Question. We've been getting hammered by search engine bots
 [?], but

Re: [Evergreen-general] Question about search engine bots & DB CPU spikes

2021-12-03 Thread JonGeorg SageLibrary via Evergreen-general
Yeah, I'm not seeing any /opac/extras/unapi requests in the Apache logs.
Is DorkBot used legitimately for querying the opac?
-Jon

On Fri, Dec 3, 2021 at 10:37 AM JonGeorg SageLibrary <
jongeorg.sagelibr...@gmail.com> wrote:

> Thank you!
> -Jon
>
> On Fri, Dec 3, 2021 at 8:10 AM Blake Henderson via Evergreen-general <
> evergreen-general@list.evergreen-ils.org> wrote:
>
>> JonGeorg,
>>
>> This reminds me of a similar issues that we had. We resolved it with this
>> change to NGINX. Here's the link:
>>
>>
>> https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/blake/LP1913610_nginx_request_limits
>>
>> and the bug:
>> https://bugs.launchpad.net/evergreen/+bug/1913610
>>
>> I'm not sure that it's the same issue though, as you've shared a search
>> SQL query and this solution addresses external requests to
>> "/opac/extras/unapi"
>> But you might be able to apply the same nginx rate limiting technique
>> here if you can detect the URL they are using.
>>
>> There is a tool called "apachetop" which I used in order to see the URL's
>> that were being used.
>>
>> apt-get -y install apachetop && apachetop -f
>> /var/log/apache2/other_vhosts_access.log
>>
>> and another useful command:
>>
>> cat /var/log/apache2/other_vhosts_access.log | awk '{print $2}' | sort |
>> uniq -c | sort -rn
>>
>> You have to ignore (not limit) all the requests to the Evergreen gateway
>> as most of that traffic is the staff client and should (probably) not be
>> limited.
>>
>> I'm just throwing some ideas out there for you. Good luck!
>>
>> -Blake-
>> Conducting Magic
>> Can consume data in any format
>> MOBIUS
>>
>> On 12/2/2021 9:07 PM, JonGeorg SageLibrary via Evergreen-general wrote:
>>
>> I tried that and still got the loopback address, after restarting
>> services. Any other ideas? And the robots.txt file seems to be doing
>> nothing, which is not much of a surprise. I've reached out to the people
>> who host our network and have control of everything on the other side of
>> the firewall.
>> -Jon
>>
>>
>> On Wed, Dec 1, 2021 at 3:57 AM Jason Stephenson  wrote:
>>
>>> JonGeorg,
>>>
>>> If you're using nginx as a proxy, that may be the configuration of
>>> Apache and nginx.
>>>
>>> First, make sure that mod_remote_ip is installed and enabled for Apache
>>> 2.
>>>
>>> Then, in eg_vhost.conf, find the 3 lines the begin with
>>> "RemoteIPInternalProxy 127.0.0.1/24" and uncomment them.
>>>
>>> Next, see what header Apache checks for the remote IP address. In my
>>> example it is "RemoteIPHeader X-Forwarded-For"
>>>
>>> Next, make sure that the following two lines appear in BOTH "location /"
>>> blocks in the ngins configuration:
>>>
>>>  proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
>>>  proxy_set_header X-Forwarded-Proto $scheme;
>>>
>>> After reloading/restarting nginx and Apache, you should start seeing
>>> remote IP addresses in the Apache logs.
>>>
>>> Hope that helps!
>>> Jason
>>>
>>>
>>> On 12/1/21 12:53 AM, JonGeorg SageLibrary wrote:
>>> > Because we're behind a firewall, all the addresses display as
>>> 127.0.0.1.
>>> > I can talk to the people who administer the firewall though about
>>> > blocking IP's. Thanks
>>> > -Jon
>>> >
>>> > On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via Evergreen-general
>>> > >> > > wrote:
>>> >
>>> > JonGeorg,
>>> >
>>> > Check your Apache logs for the source IP addresses. If you can't
>>> find
>>> > them, I can share the correct configuration for Apache with Nginx
>>> so
>>> > that you will get the addresses logged.
>>> >
>>> > Once you know the IP address ranges, block them. If you have a
>>> > firewall,
>>> > I suggest you block them there. If not, you can block them in
>>> Nginx or
>>> > in your load balancer configuration if you have one and it allows
>>> that.
>>> >
>>> > You may think you want your catalog to show up in search engines,
>>> but
>>> > bad bots will lie about who they are. All you can do with
>>> misbehaving
>>> > bots is to block them.
>>> >
>>> > HtH,
>>> > Jason
>>> >
>>> > On 11/30/21 9:34 PM, JonGeorg SageLibrary via Evergreen-general
>>> wrote:
>>> >  > Question. We've been getting hammered by search engine bots
>>> [?], but
>>> >  > they seem to all query our system at the same time. Enough that
>>> it's
>>> >  > crashing the app servers. We have a robots.txt file in place.
>>> I've
>>> >  > increased the crawling delay speed from 3 to 10 seconds, and
>>> have
>>> >  > explicitly disallowed the specific bots, but I've seen no change
>>> > from
>>> >  > the worst offenders - Bingbot and UT-Dorkbot. We had over 4k
>>> hits
>>> > from
>>> >  > Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in
>>> the
>>> > same
>>> >  > timeframe. All a couple hours after I made the changes to the
>>> robots
>>> >  > file and restarted apache services. Which out 

Re: [Evergreen-general] Question about search engine bots & DB CPU spikes

2021-12-03 Thread JonGeorg SageLibrary via Evergreen-general
Thank you!
-Jon

On Fri, Dec 3, 2021 at 8:10 AM Blake Henderson via Evergreen-general <
evergreen-general@list.evergreen-ils.org> wrote:

> JonGeorg,
>
> This reminds me of a similar issues that we had. We resolved it with this
> change to NGINX. Here's the link:
>
>
> https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/blake/LP1913610_nginx_request_limits
>
> and the bug:
> https://bugs.launchpad.net/evergreen/+bug/1913610
>
> I'm not sure that it's the same issue though, as you've shared a search
> SQL query and this solution addresses external requests to
> "/opac/extras/unapi"
> But you might be able to apply the same nginx rate limiting technique here
> if you can detect the URL they are using.
>
> There is a tool called "apachetop" which I used in order to see the URL's
> that were being used.
>
> apt-get -y install apachetop && apachetop -f
> /var/log/apache2/other_vhosts_access.log
>
> and another useful command:
>
> cat /var/log/apache2/other_vhosts_access.log | awk '{print $2}' | sort |
> uniq -c | sort -rn
>
> You have to ignore (not limit) all the requests to the Evergreen gateway
> as most of that traffic is the staff client and should (probably) not be
> limited.
>
> I'm just throwing some ideas out there for you. Good luck!
>
> -Blake-
> Conducting Magic
> Can consume data in any format
> MOBIUS
>
> On 12/2/2021 9:07 PM, JonGeorg SageLibrary via Evergreen-general wrote:
>
> I tried that and still got the loopback address, after restarting
> services. Any other ideas? And the robots.txt file seems to be doing
> nothing, which is not much of a surprise. I've reached out to the people
> who host our network and have control of everything on the other side of
> the firewall.
> -Jon
>
>
> On Wed, Dec 1, 2021 at 3:57 AM Jason Stephenson  wrote:
>
>> JonGeorg,
>>
>> If you're using nginx as a proxy, that may be the configuration of
>> Apache and nginx.
>>
>> First, make sure that mod_remote_ip is installed and enabled for Apache 2.
>>
>> Then, in eg_vhost.conf, find the 3 lines the begin with
>> "RemoteIPInternalProxy 127.0.0.1/24" and uncomment them.
>>
>> Next, see what header Apache checks for the remote IP address. In my
>> example it is "RemoteIPHeader X-Forwarded-For"
>>
>> Next, make sure that the following two lines appear in BOTH "location /"
>> blocks in the ngins configuration:
>>
>>  proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
>>  proxy_set_header X-Forwarded-Proto $scheme;
>>
>> After reloading/restarting nginx and Apache, you should start seeing
>> remote IP addresses in the Apache logs.
>>
>> Hope that helps!
>> Jason
>>
>>
>> On 12/1/21 12:53 AM, JonGeorg SageLibrary wrote:
>> > Because we're behind a firewall, all the addresses display as
>> 127.0.0.1.
>> > I can talk to the people who administer the firewall though about
>> > blocking IP's. Thanks
>> > -Jon
>> >
>> > On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via Evergreen-general
>> > > > > wrote:
>> >
>> > JonGeorg,
>> >
>> > Check your Apache logs for the source IP addresses. If you can't
>> find
>> > them, I can share the correct configuration for Apache with Nginx so
>> > that you will get the addresses logged.
>> >
>> > Once you know the IP address ranges, block them. If you have a
>> > firewall,
>> > I suggest you block them there. If not, you can block them in Nginx
>> or
>> > in your load balancer configuration if you have one and it allows
>> that.
>> >
>> > You may think you want your catalog to show up in search engines,
>> but
>> > bad bots will lie about who they are. All you can do with
>> misbehaving
>> > bots is to block them.
>> >
>> > HtH,
>> > Jason
>> >
>> > On 11/30/21 9:34 PM, JonGeorg SageLibrary via Evergreen-general
>> wrote:
>> >  > Question. We've been getting hammered by search engine bots [?],
>> but
>> >  > they seem to all query our system at the same time. Enough that
>> it's
>> >  > crashing the app servers. We have a robots.txt file in place.
>> I've
>> >  > increased the crawling delay speed from 3 to 10 seconds, and have
>> >  > explicitly disallowed the specific bots, but I've seen no change
>> > from
>> >  > the worst offenders - Bingbot and UT-Dorkbot. We had over 4k hits
>> > from
>> >  > Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in the
>> > same
>> >  > timeframe. All a couple hours after I made the changes to the
>> robots
>> >  > file and restarted apache services. Which out of 100k entries in
>> the
>> >  > vhosts files in that time frame doesn't sound like a lot, but the
>> > rest
>> >  > of the traffic looks normal. This issue has been happening
>> >  > intermittently [last 3 are 11/30, 11/3, 7/20] for a while, and
>> > the only
>> >  > thing that seems to work is to manually kill the services on the
>> DB
>> >  > servers 

Re: [Evergreen-general] Question about search engine bots & DB CPU spikes

2021-12-03 Thread Blake Henderson via Evergreen-general

JonGeorg,

This reminds me of a similar issues that we had. We resolved it with 
this change to NGINX. Here's the link:


https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/blake/LP1913610_nginx_request_limits

and the bug:
https://bugs.launchpad.net/evergreen/+bug/1913610

I'm not sure that it's the same issue though, as you've shared a search 
SQL query and this solution addresses external requests to 
"/opac/extras/unapi"
But you might be able to apply the same nginx rate limiting technique 
here if you can detect the URL they are using.


There is a tool called "apachetop" which I used in order to see the 
URL's that were being used.


apt-get -y install apachetop && apachetop -f 
/var/log/apache2/other_vhosts_access.log


and another useful command:

cat /var/log/apache2/other_vhosts_access.log | awk '{print $2}' | sort | 
uniq -c | sort -rn


You have to ignore (not limit) all the requests to the Evergreen gateway 
as most of that traffic is the staff client and should (probably) not be 
limited.


I'm just throwing some ideas out there for you. Good luck!

-Blake-
Conducting Magic
Can consume data in any format
MOBIUS

On 12/2/2021 9:07 PM, JonGeorg SageLibrary via Evergreen-general wrote:
I tried that and still got the loopback address, after restarting 
services. Any other ideas? And the robots.txt file seems to be doing 
nothing, which is not much of a surprise. I've reached out to the 
people who host our network and have control of everything on the 
other side of the firewall.

-Jon


On Wed, Dec 1, 2021 at 3:57 AM Jason Stephenson  wrote:

JonGeorg,

If you're using nginx as a proxy, that may be the configuration of
Apache and nginx.

First, make sure that mod_remote_ip is installed and enabled for
Apache 2.

Then, in eg_vhost.conf, find the 3 lines the begin with
"RemoteIPInternalProxy 127.0.0.1/24 " and
uncomment them.

Next, see what header Apache checks for the remote IP address. In my
example it is "RemoteIPHeader X-Forwarded-For"

Next, make sure that the following two lines appear in BOTH
"location /"
blocks in the ngins configuration:

         proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
         proxy_set_header X-Forwarded-Proto $scheme;

After reloading/restarting nginx and Apache, you should start seeing
remote IP addresses in the Apache logs.

Hope that helps!
Jason


On 12/1/21 12:53 AM, JonGeorg SageLibrary wrote:
> Because we're behind a firewall, all the addresses display as
127.0.0.1.
> I can talk to the people who administer the firewall though about
> blocking IP's. Thanks
> -Jon
>
> On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via
Evergreen-general
>  > wrote:
>
>     JonGeorg,
>
>     Check your Apache logs for the source IP addresses. If you
can't find
>     them, I can share the correct configuration for Apache with
Nginx so
>     that you will get the addresses logged.
>
>     Once you know the IP address ranges, block them. If you have a
>     firewall,
>     I suggest you block them there. If not, you can block them
in Nginx or
>     in your load balancer configuration if you have one and it
allows that.
>
>     You may think you want your catalog to show up in search
engines, but
>     bad bots will lie about who they are. All you can do with
misbehaving
>     bots is to block them.
>
>     HtH,
>     Jason
>
>     On 11/30/21 9:34 PM, JonGeorg SageLibrary via
Evergreen-general wrote:
>      > Question. We've been getting hammered by search engine
bots [?], but
>      > they seem to all query our system at the same time.
Enough that it's
>      > crashing the app servers. We have a robots.txt file in
place. I've
>      > increased the crawling delay speed from 3 to 10 seconds,
and have
>      > explicitly disallowed the specific bots, but I've seen no
change
>     from
>      > the worst offenders - Bingbot and UT-Dorkbot. We had over
4k hits
>     from
>      > Dorkbot alone from 2pm-5pm today, and over 5k from
Bingbot in the
>     same
>      > timeframe. All a couple hours after I made the changes to
the robots
>      > file and restarted apache services. Which out of 100k
entries in the
>      > vhosts files in that time frame doesn't sound like a lot,
but the
>     rest
>      > of the traffic looks normal. This issue has been happening
>      > intermittently [last 3 are 11/30, 11/3, 7/20] for a
while, and
>     the only
>      > thing that seems to work is to manually kill the services
on the DB
>      > servers and restart services on the application servers.
>      >
>      > The sympto