Just in case you don't get it (or anybody else wants it as well) I've also
posted it here:
ftp://ftp.depressedpress.com/UserAgents.zip
I haven't gone through the new entries in a long while. there were
three-thousand more rows (there's around 6.5 thousand now). I've just gone
through them and marked the obvious ones. Some notes however:
1) Many of the agents are obviously bogus (there are many that are just
random strings of characters). There's no way to tell if these are bots or
not.
2) Many of the strings are programmatic interfaces ("CURL", "COLDFUSION",
etc) - there's really no way to tell if these are homemade bots or homemade
browsers.
Hope this helps,
Jim Davis
_____
From: Mark A. Kruger - CFG [mailto:[EMAIL PROTECTED]
Sent: Sunday, April 04, 2004 2:35 PM
To: CF-Talk
Subject: RE: user agent checking and spidering...
Jim,
Thanks - that might be a good place to start. Can you send it to my email?
thanks!
-Mark
-----Original Message-----
From: Jim Davis [mailto:[EMAIL PROTECTED]
Sent: Sunday, April 04, 2004 1:27 PM
To: CF-Talk
Subject: RE: user agent checking and spidering...
I'm not sure if it's the best way to do things but I may be able to help
with the user agents. Basically what I've done is capture all the user
agents to hit my sites over the past few years. I go through periodically
and (using a bit column in the table) mark whether the agents are bots or
not.
I'm not saying its 100% accurate (or complete) but whatever is? I can
send
you the data if you like (let me know how you'd like it). It's sizable.
there are many thousands of rows.
I use the table to determine which session on the application are
generated
by bots and a prevent those sessions from being stored in my metrics
application (reduces clutter significantly).
If you want more accuracy/completeness you may also consider checking out
browserhawk (forgot the company name) - it's a user-agent parsing
component
that works from a regularly updated database of agent information. It'll
give you much more than just "isBot" but will also cost you.
Let me know if you want my database.
Jim Davis
_____
From: Mark A. Kruger - CFG [mailto:[EMAIL PROTECTED]
Sent: Sunday, April 04, 2004 2:06 PM
To: CF-Talk
Subject: user agent checking and spidering...
Cf talkers,
I have a client with many many similar sites on a single server using
CFMX.
Each of the sites is part of a "network" of
sites that all link together - about 150 to 200 sites in all. Each home
page has links to other sites in the network.
Periodically, it appears that google or a similar search engine hits a
home
page and spiders the links - which of
course leads it to other sites on the server and other links. This
generates
(again - this is my hypothesus from
examining the logs and behaviour) concurrent requests for similar pages
that
all hit the same "news" database (in
access). Sequelink (the access service for Jrun I think) locks up quickly
trying to service hundreds of requests at once
to the same access file. This results in a climbing request queue that
climbs into the thousands and requires a restart
of the CFMX services.
To fix this issue I am migrating the databases over to SQL server which
will
help greatly with stability, but this will
take a little time and there is still the problem of trying to avoid
letting
a spider hit this single server with so
many requests at once. Each site has a pretty well thought out robots.txt
file, but it doesn't help because the links
in question are to external sites - not pages on THIS site (even though
these external sites are virtuals on the same
server).
I'm considering suggesting a "mask" be installed for spider agents that
eliminates the absolute links and only exposes
the "internal" links - which are controlled by the robots.txt. I'd like
to
know if:
A) in anyone's experience my hypothesis may be correct and
B) Is there anything I should watch out for in masking these links
C) Does anyone know of a link that gives me the string values of the
various
user-agents I'm trying to look for.
Any help will be appreciated - thanks!
-Mark
Mark A. Kruger, MCSE, CFG
www.cfwebtools.com
www.necfug.com
http://blog.mxconsulting.com
...what the web can be!
_____
_____
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

