Thanks!

(I'm keeping this on the list, in case there is somebody else interested in the topic... hope it doesn't bother the others too much.)

So, my goal is this. I get a list of url's from another script, and I would like to create an output file in format:

CURR_URL_www.blah1.com

blah blah blah
blah blah blah

CURR_URL_www.blah2.com

blah blah blah

...

for all the url's in the list (skipping things that end in .doc, .pdf, etc.)

I would like latin1 entities to be resolved and the contents should be in unadorned text format, with no meta-information. (E.g., those lists of ''links found in the page'' that one gets with lynx -dump are extremely annoying for my purposes.)

I used to do this with LWP::Simple, HTML::Parse and HTML::FormatText.

However, I've seen that HTML::Parse is now deprecated, and my old script gave me some problems anyway, so I feel like it's time to update it.

Is there a way in which curl could help me? (I would rather write a short script with lots of imperfections than something really good that takes a week...)

Thanks again!

Marco



Reply via email to