I recently observed a "Nutch" bot crawling one of the sites I help maintain, www.rhiana-griffith.com, but noticed some oddities in the way it was crawling. It kept attempting to call up nonexistent pages that have, in the entire history of the website, /never/ existed. In essence, the bot was attempting to find hidden content by jumbling different variables together.

For example:

   143.248.92.150 - - [28/Jul/2011:02:43:11 -0700] "GET
   /board/Themes/RGFC/images/powered-php.gif;board=21.0 HTTP/1.0" 404
   2786 "-" "Nutch/Nutch-1.0 (academic purpose; cats.kaist.ac.kr;
   [email protected])"

Everything in that address is actually valid until it adds ";board=21.0" at the end. It then repeated that add-on several hours later:

   143.248.92.150 - - [28/Jul/2011:08:06:58 -0700] "GET
   /board/Themes/RGFC/images/valid-css.gif;board=21.0 HTTP/1.0" 404
   2786 "-" "Nutch/Nutch-1.0 (academic purpose; cats.kaist.ac.kr;
   [email protected])"

Again, the addition of ";board=21.0" makes no sense in any context; adding that would never work and never has worked. Several hours later it did this:

   143.248.92.150 - - [28/Jul/2011:13:54:55 -0700] "GET
   /fiction/index.php;board=32.0 HTTP/1.0" 404 2786 "-"
   "Nutch/Nutch-1.0 (academic purpose; cats.kaist.ac.kr;
   [email protected])"

The fiction system is controlled by an entirely different kind of software than the message board system, and has no boards at all, much less the ";board="32.0" that it attempts to create. Again, this is a link that never existed and never could have been indexed, that it has attempted to generate and crawl. A few hours later it did this:

   143.248.92.150 - - [28/Jul/2011:17:01:28 -0700] "GET
   /board/Themes/RGFC/images/valid-css.gif;u=237 HTTP/1.0" 404 2786 "-"
   "Nutch/Nutch-1.0 (academic purpose; cats.kaist.ac.kr;
   [email protected])"

"u=237" is particularly interesting to me because it refers to a user on our board whose hotmail account has either been stolen or is being heavily spoofed, because many of us have recently received spam mails purporting to originate from that address. Only logged-in users would be able to see her email address, however. But it raised a large red flag for me. Again, though, tagging the URL with ";u=237" generated a 404 error because such an address has never existed.

An hour and a half later, it generated this error:

   143.248.92.150 - - [28/Jul/2011:18:34:35 -0700] "GET /index.php;u=8
   HTTP/1.0" 404 2786 "-" "Nutch/Nutch-1.0 (academic purpose;
   cats.kaist.ac.kr; [email protected])"

"u=8" is an interesting choice there because on our message board, that's the user-number of one of our most active admins. It's definitely an attempt to spoof board data because on our fiction system, users are designated by "uid=#" rather than just "u=#" but again, adding that to our main index page generates an error because such a URL has never, ever existed. Almost an hour later it did this:

   143.248.92.150 - - [28/Jul/2011:19:13:19 -0700] "GET
   /board/index.php;board=11.0?action=reminder HTTP/1.0" 404 2786 "-"
   "Nutch/Nutch-1.0 (academic purpose; cats.kaist.ac.kr;
   [email protected])"

Again, it attempted to apply a nonexistent function to an existing URL. "?action=reminder" is not a legitimate function even for logged-in and admin-level users, much less a random bot cruising our pages. It would have no reason to have ever found such a link on our pages so it's manufacturing it in an attempt to see if something might be there. Soon after, it went to another part of our site and made another manufactured link:

   143.248.92.150 - - [28/Jul/2011:19:38:23 -0700] "GET
   /links/index.php;board=11.0 HTTP/1.0" 404 2786 "-" "Nutch/Nutch-1.0
   (academic purpose; cats.kaist.ac.kr; [email protected])"

The /links/ directory is controlled by a coppermine photo gallery and, thus, has no boards. A few hours later it returned to another of our directories, also controlled by a Coppermine gallery, and attempted to pull up a nonexistent graphic with a nonexistent board designation:

   143.248.92.150 - - [28/Jul/2011:22:06:50 -0700] "GET
   /board/Themes/RGFC/images/english/calendar1.png;board=21.0 HTTP/1.0"
   404 2786 "-" "Nutch/Nutch-1.0 (academic purpose; cats.kaist.ac.kr;
   [email protected])"

This bot made a number of legitimate page inquiries as well during its visits to the site, but these particular inquiries are very disturbing because they demonstrate a pattern of attempting to discover non-indexed content via the application of random variables. That kind of approach is the trademark of hackers and malware, not legitimate bots. I would like some reassurance that this behavior is NOT generated by your Nutch program. I'm considering banning the IP address of this bot, but I'm also tempted to ban the Nutch bot altogether on the .htaccess level, and blacklist it on all of the sites I'm involved with, unless I receive reassurance that this sort of behavior is NOT standard for your software. The site in question has been attacked by hackers in the past, so I take all suspicious intrusions very seriously and hope you do, as well.

Thank you for your help with this matter.

A.

Reply via email to