Phil White ([EMAIL PROTECTED]):

> Been playing with analog for a little while now, and last night
> decided to take a look at the ancrawl spider.

> Assuming it was taking a gental perusal of my site, I left it to run 
> overnight. Ha! This afternoon, as it was STILL running, I killed it, and 
> decided to actually read the code.

By default AnalogCrawler will pause 5 seconds between each request to
the server. If you have a 15,000 page site, it could easily take 20
hours to crawl it.


> Not surprising it had got lost. There is no support for either the 
> /robots.txt file, or the robots meta tag.

AnalogCrawler does have support for /robots.txt and should behave by
the rules of the Standard for Robot Exclusion. Technically,
AnalogCrawler implements the LWP::RobotUA Perl module which should
handle the processing and implementation of the rules in your
/robots.txt file.


> It had got lost in an endless loop of dynamically generated pages!

This is a more sticky problem. If your site dynamically generates
pages that can produce new, unique URLs, then the crawler (or any
robot, for that matter) can't really know where to stop. For that
matter, it would seem possible that your visitors could hit any of
those pages as well and therefore you'd want the title information
from AnalogCrawler to correlate with the request information in your
log files.

The best solution is to rework the dynamic section of your site to
generate the same URL for any page that produces the same content.
AnalogCrawler will keeps track of the pages it has crawled so it will
not get lots in link loops. There must be a finite amount of
information that your site can generate and therefore there should be
an end to the dynamic pages at some point.


> Question: Has anyone addressed this, or am I going to have to start from 
> scratch?

If you are certain that the crawler is not behaving properly, can you
contact me off the list with more information about your site, so I
can work out any bugs that may be in the software?

Thanks,

--

Jeremy Wadsack
Wadsack-Allen Digital Group

+------------------------------------------------------------------------
|  This is the analog-help mailing list. To unsubscribe from this
|  mailing list, go to
|    http://lists.isite.net/listgate/analog-help/unsubscribe.html
|
|  List archives are available at
|    http://www.mail-archive.com/[email protected]/
|    http://lists.isite.net/listgate/analog-help/archives/
|    http://www.tallylist.com/archives/index.cfm/mlist.7
+------------------------------------------------------------------------

Reply via email to