site spider using mech

Henrik Hudson Wed, 03 Aug 2005 10:31:17 -0700

Hey List-

using Perl 5.8 and the most recent mech release.


I wrote a site spider (internal site) that searches through all the HTML links 
and looks at each page for various pieces of information.

Problem I'm running into is links that link to large PDF or PPT files and them 
clogging up the works. I'm trying to figure out how to just download the 
headers so I can determine if the file is HTML or not (via $mech->is_html() ) 
and if it isn't just skip it.

It seems the $mech->get($url) method still loads the whole file before I can 
just look at the headers. The Mech docs say that the get function is 
overloaded from the UserAgent function. Now, reading the docs on that it has 
a size limiter.  How can I limit the size of the download? I don't really 
care as long as I just grab the headers.

I was also trying to move in the direction of just creating a new UserAgent 
object and using the HTTP::Request function to grab just the headers.

Anyone have any better ideas?

Thanks.

Henrik
-- 
Henrik Hudson
[EMAIL PROTECTED]

RTFM: Not just an acronym, it's the LAW!

site spider using mech

Reply via email to