Re: [Evergreen-general] Question about search engine bots & DB CPU spikes

2021-11-30 Thread JonGeorg SageLibrary via Evergreen-general
Because we're behind a firewall, all the addresses display as 127.0.0.1. I
can talk to the people who administer the firewall though about blocking
IP's. Thanks
-Jon

On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via Evergreen-general <
evergreen-general@list.evergreen-ils.org> wrote:

> JonGeorg,
>
> Check your Apache logs for the source IP addresses. If you can't find
> them, I can share the correct configuration for Apache with Nginx so
> that you will get the addresses logged.
>
> Once you know the IP address ranges, block them. If you have a firewall,
> I suggest you block them there. If not, you can block them in Nginx or
> in your load balancer configuration if you have one and it allows that.
>
> You may think you want your catalog to show up in search engines, but
> bad bots will lie about who they are. All you can do with misbehaving
> bots is to block them.
>
> HtH,
> Jason
>
> On 11/30/21 9:34 PM, JonGeorg SageLibrary via Evergreen-general wrote:
> > Question. We've been getting hammered by search engine bots [?], but
> > they seem to all query our system at the same time. Enough that it's
> > crashing the app servers. We have a robots.txt file in place. I've
> > increased the crawling delay speed from 3 to 10 seconds, and have
> > explicitly disallowed the specific bots, but I've seen no change from
> > the worst offenders - Bingbot and UT-Dorkbot. We had over 4k hits from
> > Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in the same
> > timeframe. All a couple hours after I made the changes to the robots
> > file and restarted apache services. Which out of 100k entries in the
> > vhosts files in that time frame doesn't sound like a lot, but the rest
> > of the traffic looks normal. This issue has been happening
> > intermittently [last 3 are 11/30, 11/3, 7/20] for a while, and the only
> > thing that seems to work is to manually kill the services on the DB
> > servers and restart services on the application servers.
> >
> > The symptom is an immediate spike in the Database CPU load. I start
> > killing all queries older than 2 minutes, but it still usually
> > overwhelms the system causing the app servers to stop serving requests.
> > The stuck queries are almost always ones along the lines of:
> >
> > -- bib search: #CD_documentLength #CD_meanHarmonic #CD_uniqueWords
> > from_metarecord(*/BIB_RECORD#/*) core_limit(10)
> > badge_orgs(1,138,151) estimation_strategy(inclusion) skip_check(0)
> > check_limit(1000) sort(1) filter_group_entry(1) 1
> > site(*/LIBRARY_BRANCH/*) depth(2)
> >  +
> >   |   | WITH w AS (
> >  |   | WITH */STRING/*_keyword_xq AS (SELECT
> >  +
> >   |   |   (to_tsquery('english_nostop',
> > COALESCE(NULLIF( '(' ||
> >
> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
>
> > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|')  || ')', '()'), '')) ||
> > to_tsquery('simple', COALESCE(NULLIF( '(' ||
> >
> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
>
> > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|')  || ')', '()'), ''))) AS
> > tsq,+
> >   |   |   (to_tsquery('english_nostop',
> > COALESCE(NULLIF( '(' ||
> > btrim(regexp_replace(split_date_range(search_normalize
> >   00:02:17.319491 | */STRING/* |
> >
> > And the queries by DorkBot look like they could be starting the query
> > since it's using the basket function in the OPAC.
> >
> > "GET
> >
> /eg/opac/results?do_basket_action=Go=1_record_view=*/LONG_STRING/*=Search_highlight=1=metabib_basket_action=1=keyword%3Amat_format=1=112=1
>
> > HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0"
> >
> > I've anonymized the output just to be cautious. Reports are run off the
> > backup database server, so it cannot be an auto generated report, and it
> > doesn't happen often enough for that either. At this point I'm tempted
> > to block the IP addresses. What strategies are you all using to deal
> > with crawlers, and does anyone have an idea what is causing this?
> > -Jon
> >
> > ___
> > Evergreen-general mailing list
> > Evergreen-general@list.evergreen-ils.org
> > http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
> >
> ___
> Evergreen-general mailing list
> Evergreen-general@list.evergreen-ils.org
> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
>
___
Evergreen-general mailing list
Evergreen-general@list.evergreen-ils.org
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general


Re: [Evergreen-general] Question about search engine bots & DB CPU spikes

2021-11-30 Thread Jason Stephenson via Evergreen-general

JonGeorg,

Check your Apache logs for the source IP addresses. If you can't find 
them, I can share the correct configuration for Apache with Nginx so 
that you will get the addresses logged.


Once you know the IP address ranges, block them. If you have a firewall, 
I suggest you block them there. If not, you can block them in Nginx or 
in your load balancer configuration if you have one and it allows that.


You may think you want your catalog to show up in search engines, but 
bad bots will lie about who they are. All you can do with misbehaving 
bots is to block them.


HtH,
Jason

On 11/30/21 9:34 PM, JonGeorg SageLibrary via Evergreen-general wrote:
Question. We've been getting hammered by search engine bots [?], but 
they seem to all query our system at the same time. Enough that it's 
crashing the app servers. We have a robots.txt file in place. I've 
increased the crawling delay speed from 3 to 10 seconds, and have 
explicitly disallowed the specific bots, but I've seen no change from 
the worst offenders - Bingbot and UT-Dorkbot. We had over 4k hits from 
Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in the same 
timeframe. All a couple hours after I made the changes to the robots 
file and restarted apache services. Which out of 100k entries in the 
vhosts files in that time frame doesn't sound like a lot, but the rest 
of the traffic looks normal. This issue has been happening 
intermittently [last 3 are 11/30, 11/3, 7/20] for a while, and the only 
thing that seems to work is to manually kill the services on the DB 
servers and restart services on the application servers.


The symptom is an immediate spike in the Database CPU load. I start 
killing all queries older than 2 minutes, but it still usually 
overwhelms the system causing the app servers to stop serving requests. 
The stuck queries are almost always ones along the lines of:


-- bib search: #CD_documentLength #CD_meanHarmonic #CD_uniqueWords 
from_metarecord(*/BIB_RECORD#/*) core_limit(10) 
badge_orgs(1,138,151) estimation_strategy(inclusion) skip_check(0) 
check_limit(1000) sort(1) filter_group_entry(1) 1 
site(*/LIBRARY_BRANCH/*) depth(2)
                     +

                  |       |         WITH w AS (
                 |       | WITH */STRING/*_keyword_xq AS (SELECT
                                                 +
                  |       |       (to_tsquery('english_nostop', 
COALESCE(NULLIF( '(' || 
btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')), 
*/LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|')  || ')', '()'), '')) || 
to_tsquery('simple', COALESCE(NULLIF( '(' || 
btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')), 
*/LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|')  || ')', '()'), ''))) AS 
tsq,+
                  |       |       (to_tsquery('english_nostop', 
COALESCE(NULLIF( '(' || 
btrim(regexp_replace(split_date_range(search_normalize

  00:02:17.319491 | */STRING/* |

And the queries by DorkBot look like they could be starting the query 
since it's using the basket function in the OPAC.


"GET 
/eg/opac/results?do_basket_action=Go=1_record_view=*/LONG_STRING/*=Search_highlight=1=metabib_basket_action=1=keyword%3Amat_format=1=112=1 
HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0"


I've anonymized the output just to be cautious. Reports are run off the 
backup database server, so it cannot be an auto generated report, and it 
doesn't happen often enough for that either. At this point I'm tempted 
to block the IP addresses. What strategies are you all using to deal 
with crawlers, and does anyone have an idea what is causing this?

-Jon

___
Evergreen-general mailing list
Evergreen-general@list.evergreen-ils.org
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general


___
Evergreen-general mailing list
Evergreen-general@list.evergreen-ils.org
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general


[Evergreen-general] Question about search engine bots & DB CPU spikes

2021-11-30 Thread JonGeorg SageLibrary via Evergreen-general
Question. We've been getting hammered by search engine bots [?], but they
seem to all query our system at the same time. Enough that it's crashing
the app servers. We have a robots.txt file in place. I've increased the
crawling delay speed from 3 to 10 seconds, and have explicitly disallowed
the specific bots, but I've seen no change from the worst offenders -
Bingbot and UT-Dorkbot. We had over 4k hits from Dorkbot alone from 2pm-5pm
today, and over 5k from Bingbot in the same timeframe. All a couple hours
after I made the changes to the robots file and restarted apache services.
Which out of 100k entries in the vhosts files in that time frame doesn't
sound like a lot, but the rest of the traffic looks normal. This issue has
been happening intermittently [last 3 are 11/30, 11/3, 7/20] for a while,
and the only thing that seems to work is to manually kill the services on
the DB servers and restart services on the application servers.

The symptom is an immediate spike in the Database CPU load. I start killing
all queries older than 2 minutes, but it still usually overwhelms the
system causing the app servers to stop serving requests. The stuck queries
are almost always ones along the lines of:

-- bib search: #CD_documentLength #CD_meanHarmonic #CD_uniqueWords
from_metarecord(*BIB_RECORD#*) core_limit(10) badge_orgs(1,138,151)
estimation_strategy(inclusion) skip_check(0) check_limit(1000) sort(1)
filter_group_entry(1) 1 site(*LIBRARY_BRANCH*) depth(2)
+
 |   | WITH w AS (
|   | WITH *STRING*_keyword_xq AS (SELECT
  +
 |   |   (to_tsquery('english_nostop',
COALESCE(NULLIF( '(' ||
btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
*LONG_STRING*))),E'(?:\\s+|:)','&','g'),'&|')  || ')', '()'), '')) ||
to_tsquery('simple', COALESCE(NULLIF( '(' ||
btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
*LONG_STRING*))),E'(?:\\s+|:)','&','g'),'&|')  || ')', '()'), ''))) AS tsq,+
 |   |   (to_tsquery('english_nostop',
COALESCE(NULLIF( '(' ||
btrim(regexp_replace(split_date_range(search_normalize
 00:02:17.319491 | *STRING* |

And the queries by DorkBot look like they could be starting the query since
it's using the basket function in the OPAC.

"GET /eg/opac/results?do_basket_action=Go=1_record_view=
*LONG_STRING*=Search_highlight=1=metabib_basket_action=1=keyword%3Amat_format=1=112=1
HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0"

I've anonymized the output just to be cautious. Reports are run off the
backup database server, so it cannot be an auto generated report, and it
doesn't happen often enough for that either. At this point I'm tempted to
block the IP addresses. What strategies are you all using to deal with
crawlers, and does anyone have an idea what is causing this?
-Jon
___
Evergreen-general mailing list
Evergreen-general@list.evergreen-ils.org
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general


Re: [Evergreen-general] Holds List

2021-11-30 Thread Brandi Sanders via Evergreen-general
Nevermind, should have read my email first.LOL!
 
-Original Message-From: "Brandi Sanders via Evergreen-general" Sent: Tuesday, November 30, 2021 9:40amTo: "EvergreenGeneral Listserve" Subject: [Evergreen-general] Holds List

Is anyone else having a hard time getting the holds list to come up today?
 
Brandi J. Sanders
Bookmobile Director
Perry County Library
2328 Tell St., Tell City, IN 47586
812-547-2661
www.tcpclibrary.org

 



 ___
Evergreen-general mailing list
Evergreen-general@list.evergreen-ils.org
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general


[Evergreen-general] Holds List

2021-11-30 Thread Brandi Sanders via Evergreen-general
Is anyone else having a hard time getting the holds list to come up today?
 

Brandi J. Sanders
Bookmobile Director
Perry County Library
2328 Tell St., Tell City, IN 47586
812-547-2661
www.tcpclibrary.org

 
___
Evergreen-general mailing list
Evergreen-general@list.evergreen-ils.org
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general