I've emailed it to you.

Just in case you don't get it (or anybody else wants it as well) I've also
posted it here:

ftp://ftp.depressedpress.com/UserAgents.zip

I haven't gone through the new entries in a long while. there were
three-thousand more rows (there's around 6.5 thousand now).  I've just gone
through them and marked the obvious ones.  Some notes however:

1) Many of the agents are obviously bogus (there are many that are just
random strings of characters).  There's no way to tell if these are bots or
not.

2) Many of the strings are programmatic interfaces ("CURL", "COLDFUSION",
etc) - there's really no way to tell if these are homemade bots or homemade
browsers.

Hope this helps,

Jim Davis

  _____  

From: Mark A. Kruger - CFG [mailto:[EMAIL PROTECTED]
Sent: Sunday, April 04, 2004 2:35 PM
To: CF-Talk
Subject: RE: user agent checking and spidering...

Jim,

Thanks - that might be a good place to start. Can you send it to my email?
thanks!

-Mark

  -----Original Message-----
  From: Jim Davis [mailto:[EMAIL PROTECTED]
  Sent: Sunday, April 04, 2004 1:27 PM
  To: CF-Talk
  Subject: RE: user agent checking and spidering...

  I'm not sure if it's the best way to do things but I may be able to help
  with the user agents.  Basically what I've done is capture all the user
  agents to hit my sites over the past few years.  I go through periodically
  and (using a bit column in the table) mark whether the agents are bots or
  not.

  I'm not saying its 100% accurate (or complete) but whatever is?  I can
send
  you the data if you like (let me know how you'd like it).  It's sizable.
  there are many thousands of rows.

  I use the table to determine which session on the application are
generated
  by bots and a prevent those sessions from being stored in my metrics
  application (reduces clutter significantly).

  If you want more accuracy/completeness you may also consider checking out
  browserhawk (forgot the company name) - it's a user-agent parsing
component
  that works from a regularly updated database of agent information.  It'll
  give you much more than just "isBot" but will also cost you.

  Let me know if you want my database.

  Jim Davis

    _____  

  From: Mark A. Kruger - CFG [mailto:[EMAIL PROTECTED]
  Sent: Sunday, April 04, 2004 2:06 PM
  To: CF-Talk
  Subject: user agent checking and spidering...

  Cf talkers,

  I have a client with many many similar sites on a single server using
CFMX.
  Each of the sites is part of a "network" of
  sites that all link together - about 150 to 200 sites in all.  Each home
  page has links to other sites in the network.
  Periodically, it appears that google or a similar search engine  hits a
home
  page and spiders the links - which of
  course leads it to other sites on the server and other links. This
generates
  (again - this is my hypothesus from
  examining the logs and behaviour) concurrent requests for similar pages
that
  all hit the same "news" database (in
  access). Sequelink (the access service for Jrun I think) locks up quickly
  trying to service hundreds of requests at once
  to the same access file. This results in a climbing request queue that
  climbs into the thousands and requires a restart
  of the CFMX services.

  To fix this issue I am migrating the databases over to SQL server which
will
  help greatly with stability, but this will
  take a little time and there is still the problem of trying to avoid
letting
  a spider hit this single server with so
  many requests at once.  Each site has a pretty well thought out robots.txt
  file, but it doesn't help because the links
  in question are to external sites - not pages on THIS site (even though
  these external sites are virtuals on the same
  server).

  I'm considering suggesting a "mask" be installed for spider agents that
  eliminates the absolute links and only exposes
  the "internal" links - which are controlled by the robots.txt.  I'd like
to
  know if:

  A) in anyone's experience my hypothesis may be correct and

  B) Is there anything I should watch out for in masking these links

  C) Does anyone know of a link that gives me the string values of the
various
  user-agents I'm trying to look for.

  Any help will be appreciated - thanks!

  -Mark

  Mark A. Kruger, MCSE, CFG
  www.cfwebtools.com
  www.necfug.com
  http://blog.mxconsulting.com
  ...what the web can be!

    _____

  _____
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

Reply via email to