--- Dan Bron <[EMAIL PROTECTED]> wrote:

> Oleg wrote:
> >  Would be nice to be able to extract such text output
> >  from dodot.htm. 
> 
> Ironic.  The one time I refrain from using  qdoj  is the one time someone 
> would have found its
> use helpful.

Yes, it's a form of smarter qdoj (or a wrapper) that takes 
semantic parts  (header, monad, dyad, body, or a combination 
thereof) and figures out the text range.

> The problem is down-converting HTML to plain text.  It involves some 
> non-trivial aesthetic and
> design decisions (i.e. what information do I want to use?).

The "text" script could be helpfull (load 'text').

> I've considered extending  qdoj  to use as a HTML-to-plaintext converter.  
> After all, that's
> what command line web browsers are designed to do.  The first question that 
> arises is "how big
> should I tell w3m the console is?" (and the first answer that occurs is "the 
> size of the IJX
> window).

This could be simplified with assumptions about the concrete
HTML and section formatting used in DoJ. There are some quirks
though: to represent a tightly built list, DoJ uses tables
instead of a standard UL/LI and CSS for layout.

> PS:  Another option is to rewrite the Dictionary in some templatized, 
> plain-text-friendly format
> which could be readily converted to HTML.  MoinMoin markup springs to mind....

The existing HTML should be consistent and simple enough, 
so it could be treated as already templatized.

Might be possible to treat HTML as XML and use xml/sax or xml/xslt
to extract the text semantically.



      
____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to