As a result of our research, we found that the Bytespider bot was causing the excessive CPU usage. You will see this when you examine the traffic and logs on the Firewall side.
This Bytespider; spider-feedback(at)bytedance.com bot is very aggressive. It seems to be constantly making requests from China and Singapore. Moreover, they never use robots.txt and seem to attack the DSpace server every second of every day. Bytedance seems to be using a different IP every second. We need to block all of their IPs through the firewall. Also, examine the Apache/Nginx logs. You can start by trying to block IP blocks starting with 47.128. SelenSoft Consulting Turkey On Monday 23 September 2024 at 15:06:20 UTC+3 Marijka Azzopardi wrote: > Hi all, > > I'd like to ask about your experience in *managing repository server > performance and stability issues, *particularly as a result of high-bot > traffic to your DSpace repository. > > To provide some background, at UNSW Sydney, our DSpace7 repository has > been experiencing an increase in performance and stability issues due to > the heavy load being placed on our repository from several contributing > factors, including increased bot and crawler traffic. > > We plan to upgrade from DSpace v7.0 to v7.6.2 to optimise our server > performance, and access vital bug fixes and functionality that addresses > this, such as Caching of server-side rendered pages > <https://wiki.lyrasis.org/display/DSDOC7x/Performance+Tuning+DSpace#PerformanceTuningDSpace-Turnon(orincrease)cachingofServer-SideRenderedpages>. > > I am aware though that performance and scalability issues are still being > reported by DSpace v7.6.2 and v8 repository owners > <https://github.com/DSpace/dspace-angular/issues/3110> and, as a result, > solutions > are being pushed for prioritisation in a future DSpace 9 release (tentative > Apr 2025) > <https://wiki.lyrasis.org/display/DSPACE/DSpace+Release+9.0+Status#:~:text=Improving%20performance%20/%20scalability,angular/issues/3163> > . > > In the interim, although this may have reduced impact, we’re looking to > update our robots.txt “disallow” rules to prevent bot crawling of > unnecessary repository pages and reduce the number of requests made to our > server by directing ‘compliant’ search engine crawlers directly to > repository metadata and files. > > I would be very interested to know how your institution may be managing > server performance and stability issues, if you have updated your > robots.txt to direct crawler traffic and block bots, or have implemented > any other solutions e.g. integration with third-party software such as > Redis <https://redis.io/> (used by Jagiellonian University > <https://github.com/DSpace/dspace-angular/issues/3110#:~:text=There%20is%20the%20solution%20we%20used%20to%20replace%20Server%2DSide%20rendered%20pages%20with%20Redis%20in%20the%20case%20study%20on%20the%20Jagiellonian%20University%20Repository%3A%0AReplace%20the%20build%2Din%20Server%2DSide%20Rendered%20pages%20with%20Redis.pdf>) > > to cache server-side rendered pages in DSpace, or Cloudflare > <https://www.cloudflare.com/en-au/> (Content Delivery Network), etc. > > Looking forward to hearing from you and a thanks in advance for your time! > > > > Kind regards, > > *Marijka Azzopardi* > > Repository Librarian > > Scholarly Communications & Repositories, UNSW Library > UNSW SYDNEY 2052 > -- All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx --- You received this message because you are subscribed to the Google Groups "DSpace Community" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-community/90e53f7f-b61b-424c-9d2f-bbb8c87a1c3cn%40googlegroups.com.
