Jim,

Thanks - that might be a good place to start. Can you send it to my email? thanks!

-Mark

  -----Original Message-----
  From: Jim Davis [mailto:[EMAIL PROTECTED]
  Sent: Sunday, April 04, 2004 1:27 PM
  To: CF-Talk
  Subject: RE: user agent checking and spidering...

  I'm not sure if it's the best way to do things but I may be able to help
  with the user agents.  Basically what I've done is capture all the user
  agents to hit my sites over the past few years.  I go through periodically
  and (using a bit column in the table) mark whether the agents are bots or
  not.

  I'm not saying its 100% accurate (or complete) but whatever is?  I can send
  you the data if you like (let me know how you'd like it).  It's sizable.
  there are many thousands of rows.

  I use the table to determine which session on the application are generated
  by bots and a prevent those sessions from being stored in my metrics
  application (reduces clutter significantly).

  If you want more accuracy/completeness you may also consider checking out
  browserhawk (forgot the company name) - it's a user-agent parsing component
  that works from a regularly updated database of agent information.  It'll
  give you much more than just "isBot" but will also cost you.

  Let me know if you want my database.

  Jim Davis

    _____  

  From: Mark A. Kruger - CFG [mailto:[EMAIL PROTECTED]
  Sent: Sunday, April 04, 2004 2:06 PM
  To: CF-Talk
  Subject: user agent checking and spidering...

  Cf talkers,

  I have a client with many many similar sites on a single server using CFMX.
  Each of the sites is part of a "network" of
  sites that all link together - about 150 to 200 sites in all.  Each home
  page has links to other sites in the network.
  Periodically, it appears that google or a similar search engine  hits a home
  page and spiders the links - which of
  course leads it to other sites on the server and other links. This generates
  (again - this is my hypothesus from
  examining the logs and behaviour) concurrent requests for similar pages that
  all hit the same "news" database (in
  access). Sequelink (the access service for Jrun I think) locks up quickly
  trying to service hundreds of requests at once
  to the same access file. This results in a climbing request queue that
  climbs into the thousands and requires a restart
  of the CFMX services.

  To fix this issue I am migrating the databases over to SQL server which will
  help greatly with stability, but this will
  take a little time and there is still the problem of trying to avoid letting
  a spider hit this single server with so
  many requests at once.  Each site has a pretty well thought out robots.txt
  file, but it doesn't help because the links
  in question are to external sites - not pages on THIS site (even though
  these external sites are virtuals on the same
  server).

  I'm considering suggesting a "mask" be installed for spider agents that
  eliminates the absolute links and only exposes
  the "internal" links - which are controlled by the robots.txt.  I'd like to
  know if:

  A) in anyone's experience my hypothesis may be correct and

  B) Is there anything I should watch out for in masking these links

  C) Does anyone know of a link that gives me the string values of the various
  user-agents I'm trying to look for.

  Any help will be appreciated - thanks!

  -Mark

  Mark A. Kruger, MCSE, CFG
  www.cfwebtools.com
  www.necfug.com
  http://blog.mxconsulting.com
  ...what the web can be!

    _____
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

Reply via email to