Summary: A CF limitation in building a spider?

Phill Gibson Tue, 05 Dec 2000 14:55:22 -0800
Hi Everyone,

Here is a summary of responses to "A CF limitation in building a spider?"
that I posted a couple of weeks ago. I know a few folks had wanted to see
the results; maybe others will as well. Apologies to any I may have missed.

What will I do now? Use the existing CF spider for very light-duty stuff
(<=25 pages, say), then come along soon and rewrite it using Java servlets
where advantageous. I will probably also eventually put the CFHTTP one to
work with <cfschedule>.


Phill Gibson
Velawebs Web Designs
www.Velawebs.com
[EMAIL PROTECTED]


%%%%%%%%%%%%%%%%%%%%%%%%%
Original Problem:
I'm putting together a spider to be used with a search engine, and have come
up to what looks like a limitation in using CF for the task. Here's what's
going on:

I'm using cfhttp to extract links from specific sites. It loops through,
collecting all hyperlinks that are within the site and adds them to a list
to be visited. In visiting/indexing these links, however, it is still
looping through only in one CF page, hence it times out (set to about ten
minutes on our server). This won't do it if I'm indexing a site of several
hundred pages.
Does anyone know of a better way to do the recursive call to cfhttp? Anyway
I see it, you are still calling one page, and it eventually times out.
Thanks for any ideas!

Phill Gibson

%%%%%%%%%%%%%%%%%%%%%%%%%
1st Response:
From: "Joseph Thompson" <[EMAIL PROTECTED]>

I built a FTP "recursive" custom tag that may help... although I have my
development server set to not time out I think that because it calls itself
over and over it may work for you?
http://cfhub.com/taggallery/ftp2tree/ftp2tree.zip

%%%%%%%%%%%%%%%%%%%%%%%%%
2nd Response:
From: Rob Keniger <[EMAIL PROTECTED]>

Can you use <cfschedule> and set the timeout to something big? That way you
don't need to call it from the browser and you control the timeout for that
page only.

Rob Keniger
big bang solutions

%%%%%%%%%%%%%%%%%%%%%%%%%
3rd Response:
From: "Paul Mone" <[EMAIL PROTECTED]>

I don't think that CF is the right tool for this task.  In my eyes, this
falls under the category of 'offline processing', meaning that you're not
giving a user instant feedback or up to the moment data according to the
user's input.  Instead, you're assembling data that will be indexed and used
by some seperate process that interacts directly with the user.

My experience with CFHTTP has led me to use it sparingly.  When it was being
called frequently (as well a recursively) I found it to be a real
server-hog, and occasinally it made CFServer crash.  This may have been
resolved with the latest releases of 4.5.1.

This task is going to be doing alot of recursive calls and alot of file
parsing.  If possible, you might want to do something like this in Java.  I
think you will get much better performance.

%%%%%%%%%%%%%%%%%%%%%%%%%
4th Response:
From: "Paul Johnston" <[EMAIL PROTECTED]>

I have to agree completely.  Using CF as a spider is really not very
sensible.  If I was suggesting a way to do this, I would use <cfexecute>
with some program that can dump results to a text file which CF can then
use.

There is no point in using up CF's resources on something it's not really
designed for.  Until we get custom functions (which we all hope will be in
CF 5) then we can't really do recursion very easily.  This is because
calling a custom tag (even though it's 'cached' by the CF server) is
effectively calling a file and if it's more than one file in size, it is an
absolute nightmare scenario as far as load on the file system goes. As for
using it as a spider, I seriously wouldn't.

My suggestion:

Get a C++/Perl/Java spider from somewhere (there are lots out there) that
you can set running that dumps all the data you need (maybe in a WDDX
packet) to a text file.

Paul

PS Did a search on google for "web spider" and got this immediately:
http://www.tardis.ed.ac.uk/~skx/java/Spider/ . It may be awful, but it goes
to show there are lots out there.

%%%%%%%%%%%%%%%%%%%%%%%%%
5th Response:
From: Dave Watts <[EMAIL PROTECTED]>
Cc: "'[EMAIL PROTECTED]'" <[EMAIL PROTECTED]>

While I'd agree that CF isn't the ideal development tool for spiders, I'd
like to point out that recursion in CF doesn't really affect the load on the
filesystem of the CF server. The CF server reads the script file once,
builds an instruction set, and caches that instruction set in memory. Once
cached, CF doesn't need to read the file from the filesystem - it need only
read the attributes of the file. If you enable trusted cache - which you
should certainly do on production servers - it doesn't even read the file
attributes, although the web service itself may do that on the initial page
request.

CF recursion, and the use of custom tags in general, doesn't perform as well
as I'd like, due to each iteration of the script happening within a separate
memory context, but the filesystem itself won't be a problem.

Dave Watts, CTO, Fig Leaf Software
http://www.figleaf.com/

%%%%%%%%%%%%%%%%%%%%%%%%%
6th Response:
From: "Philip Arnold - ASP" <[EMAIL PROTECTED]>

Roll on CF5, hopefully they're solving a lot of these problems... of maybe
CF6 for speed purposes

Philip Arnold
ASP Multimedia Limited

%%%%%%%%%%%%%%%%%%%%%%%%%
7th Response:
From: "Steve Bernard" <[EMAIL PROTECTED]>

You're probably better off using something like Linkbot. Dump the results to
a db and use CF to create reports or whatever. This is also much more cost
effective. You won't be able to completely mimic Linkbot's features and yet
you'll spend many more man hours than an enterprise license would cost.

Steve

%%%%%%%%%%%%%%%%%%%%%%%%%
8th Response:
From: James Sleeman <[EMAIL PROTECTED]>

Others have pointed out that CF is not a great tool for the task, but I'll
point out a way that it could be done.  Using CFSCHEDULE you could
schedule URL's to be explored something like this...

    explore.cfm takes URL to explore (http://my.server/explore.cfm?URL=...")
        grabs URL (CFHTTP)
        indexes URL
        pulls out URL's in URL
        schedules explore.cfm to run with each URL collected (CFSCHEDULE)
        finished

Not only is it not directly recursive and won't likely time out (each
request only
grabs and indexes one URL), but it would concievably (depending on your
scheduling algorithm) explore in URLS in parallel.  You would need to be
careful with the scheduling algorithm though so as not to bomb your server
with too many simultaneous scheduled tasks.

James Sleeman
[EMAIL PROTECTED] (home)
[EMAIL PROTECTED] (work)


%%%%%%%%%%%%%%%%%%%%%%%%%
9th Response:
From: "Michael Thomas" <[EMAIL PROTECTED]>

Ive noticed one huge limitation that makes CF not a great choice in building
a spider. CFHTTP doesnt like the use of URL pointers to another domain, like
http://www.thisdomain.com redirects to http://www.thisotherdomain.com... In
testing whether a domain is valid or not, an url pointer will be counted as
a *Dead* Link. CFHTTP will fail for the simple fact that it didnt find a
site, even if there really is one. The problem also has nothing to do with
timing out either.

Timing out is also an issue to be concerned with. It seems some sites just
wont comply with CFHTTP. To bring another matter to attention, have you ever
seen a page when using RESOLVEURL="yes" that came back with all the links
intact??? I cant say I have.

Those are just a couple issues ive had to deal with that I thought you might
want to consider before hand. Good Luck.

Sincerely,
Mike

%%%%%%%%%%%%%%%%%%%%%%%%%
10th Response:
From: Dave Watts <[EMAIL PROTECTED]>
Cc: "'[EMAIL PROTECTED]'" <[EMAIL PROTECTED]>

> Ive noticed one huge limitation that makes CF not a great
> choice in building a spider. CFHTTP doesnt like the use of
-snip-

CFHTTP doesn't have a problem with HTTP redirects; it's up to you, as a
developer, to read the redirect from the HTTP response header and reexecute
CFHTTP if that's appropriate.

I can't think of any programming language which will automatically respond
to HTTP redirects.

Dave Watts, CTO, Fig Leaf Software
http://www.figleaf.com/
%%%%%%%%%%%%%%%%%%%%%%%%%
11th Response:
From: "Scott, Andrew" <[EMAIL PROTECTED]>

Hmmm, spider logic is always changing... However, if you were to return a
page like thus you would be doing logic on it for href's and the like to
then pull up other links to go to!!!!

Why couldn't you just add another condition to then see if the page is being
refresh with any of the known methods!!!


regards
Andrew Scott
Senior Cold Fusion Application Developer
ANZ eCommerce Centre



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        Structure your ColdFusion code with Fusebox. Get the official book at 
http://www.fusionauthority.com/bkinfo.cfm

Archives: http://www.mail-archive.com/[email protected]/
Unsubscribe: http://www.houseoffusion.com/index.cfm?sidebar=lists
Summary: A CF limitation in building a spider?

Reply via email to