(I'm keeping this on the list, in case there is somebody else interested in the topic... hope it doesn't bother the others too much.)
So, my goal is this. I get a list of url's from another script, and I would like to create an output file in format:
CURR_URL_www.blah1.com
blah blah blah blah blah blah
CURR_URL_www.blah2.com
blah blah blah
...
for all the url's in the list (skipping things that end in .doc, .pdf, etc.)
I would like latin1 entities to be resolved and the contents should be in unadorned text format, with no meta-information. (E.g., those lists of ''links found in the page'' that one gets with lynx -dump are extremely annoying for my purposes.)
I used to do this with LWP::Simple, HTML::Parse and HTML::FormatText.
However, I've seen that HTML::Parse is now deprecated, and my old script gave me some problems anyway, so I feel like it's time to update it.
Is there a way in which curl could help me? (I would rather write a short script with lots of imperfections than something really good that takes a week...)
Thanks again!
Marco
