On Fri, 10 Nov 2006 03:11:42 -0500
Joey Hess <[EMAIL PROTECTED]> wrote:
> A. Costa wrote:
> > URLcat() { wget -o /dev/null --output-document=- "$1" | html2text
> > -ascii -nobs ; }
>
> Already available as lynx -dump url (formatting html), or w3m url
> (formats html also of course, and can be used in pipelines with no
> special switches) or GET url (from libwww-perl, raw html), or dog url
> (raw html and can also cat files), or probably half a dozen others I
> don't know of.
Hmm. Thanks for the info!
'lynx' I'd known about (I use it in a script) but had somehow forgotten.
'lynx' inserts numbers next to URLs, that may not always be desirable,
plus it's biggish. 'w3m' haven't tried (also big), nor 'GET', nor
'dog'.
Installed 'dog', HTML as you say, but much smaller than 'wget'.
Just noticed that 'html2text' itself can take URLs on the command line,
(I'd thought it was just a pipe util), and that would seem almost
perfect, but on testing, it doesn't work. For example it chokes on
this:
% html2text -ascii -nobs
"http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=394813"
****** Not Found ******
The requested URL /cgi-bin/bugreport.cgi was not found on this server.
===============================================================================
Apache/1.3.33 Server at spohr.debian.org Port 80
...yet works fine if the same URL is piped from 'dog'. Well,
that's another bug report.
Summing up: certainly not a new problem. The question is, what's the
best (smallest, fastest) method? So far I'm leaning towards 'dog' plus
'html2text', or just the latter if it gets fixed. Odd how none of
the names of these programs suggest the desired action. I couldn't
disagree if you closed this absent-minded attempt at reinventing the
wheel, but OTOH don't see why a more suggestive name hooked or wrapped
around whatever the least expensive method is would necessarily be a
bad thing.
--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]