>> In package transfer or any networking of the repo server? Is the >> increase caused by malicious bots crawling our site or more users often >> updating their systems? >> >> Having specific data on this would show if encouraging using different >> mirrors can solve the problem. > > we anonymize ips on logs so user agents may give you an idea > > block the bots!
My script for processing the log is at [0]. The log file I used starts and ends with these lines: 127.0.0.1 - - [04/Mar/2012:03:45:11 +0000] "GET /isos/i686/parabola-2011.09.01-core-i686.iso HTTP/1.0" 206 45260 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" "-" [...] 127.0.0.1 - - [05/Mar/2012:17:21:39 +0000] "GET /~lukeshu/os/x86_64/~lukeshu.db HTTP/1.1" 304 0 "-" "pacman/4.0.1 (Linux x86_64) libalpm/7.0.1" "-" This is the output of my script on this log: 1107005 /skins Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/532.4 (KHTML, like Gecko) Qt/4.6.3 Safari/532.4 1336577 REPO Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20120207 Iceweasel/10.0 1787250 /pool Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) 1911287 /pool Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) 1924266 REPO Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) 1938651 /index.php?title=Special:RecentChanges&feed=atom Mozilla/5.0 (X11; Linux x86_64; rv:10.0.2) Gecko/20120219 Thunderbird/10.0.2 2468956 /docs Wget 2651128 REPO SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html) 2662872 /other Mozilla/5.0 (compatible; Ezooms/1.0; [email protected]) 3568939 /index.php?title=Special:RecentChanges&feed=atom Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=17208844250925112595) 3840112 REPO curl 3976814 REPO Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) 4384389 /sources Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) 6889936 /other curl 11755784 /other DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html) 11968356 REPO Axel 2.4 (Linux) 16766866 /isos DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html) 18488177 /other SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html) 19950376 /isos Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11 64949057 /isos SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html) 77276990 REPO Wget 123817802 REPO PackageKit 173015040 /isos Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) 184574925 /isos Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.27) Gecko/20120216 Firefox/3.6.27 189289071 REPO Mozilla/5.0 (compatible; Ezooms/1.0; [email protected]) 215261516 REPO aria2/1.14.2 272698752 /isos Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a2) Gecko/20120304 Firefox/12.0a2 277916052 /isos Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.53.11 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10 278714862 /isos Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) 280486126 /isos Mozilla/5.0 (X11; Linux i686; rv:10.0.2) Gecko/20100101 Firefox/10.0.2 Iceweasel/10.0.2 545259520 /isos Wget 650556670 REPO Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 1241702558 REPO Python-urllib 14725983912 REPO pacman All bytes: 19451591920, bot bytes: 1153843961 (5.93%). A trivial modification of the script sums /isos as 2115913496 (10.88%). Things not shown by the script: - lines with small data size – it would be too much to show, and it's mostly useless since the script lists each wiki article on a separate line - the bot sum includes only honest bots which don't claim to be MSIE or other browsers Things not logged: - accesses not done between March 4 and 5; I assume these dates aren't untypical - other data than HTTP response data size My recommendations: - add a /robots.txt file blocking all bots from anything on repo.parabolagnulinux.org - remove ISO images unless there are users who cannot use torrents or other mirrors - block bots not respecting robots.txt by user agents if future logs will show them having big traffic here - promote using other mirrors Unlike other sites, repo.parabolagnulinux.org doesn't need to be indexed by search engines, so there should be no problem with blocking bots. [0] https://mtjm.eu/patches/log_counter.py
pgpMPbIvbDT4e.pgp
Description: PGP signature
_______________________________________________ Dev mailing list [email protected] http://lists.parabolagnulinux.org/mailman/listinfo/dev
