> > Yes, I thought that would be the way to go as well down the line,
> > parsing the HTML with tagsoup [...]
>
> when I say script here I'm talking about a ruby script (partially as I
> saw this as being a converter for the manual for a limited amount of
> time, not immediate throwaway, but not a piece of code that's going to
> fester in svn for years after, partially because I know enough ruby to
> get it done in it (and I don't know enough perl/python)
>
> also for light text processing I think Java is probably too
> heavyweight (sledgehammer to crack a nut) and ruby has nice regexp
> capabilities.

Java does regex just fine, albeit more verbose (when is Java not verbose ;-),
but my main point is that you already have (Java) tools allow you to
have an XML view of the existing HTML manual (tagsoup, etc...). Leave
the parsing to these tools, and concentrate of transforming the
"loose" HTML schema into a more structured XML, probably using XSL as
the language rather than scripting. By adding a little more structure
to the HTML with <div>s, the XML view of the HTML could be complete
enough for robust transformation to XML, and perhaps even robust
enough so that the HTML remains as the official "source" document of
the manual (but stripped of all formatting, which would be added later
in the XML processing pipeline). The main advantage of this would be
that editing HTML using an HTML editor for manual editing can be
easier/nicer and kinda wysiwyg, compared to editing the transformed
XML.

> I'm aiming for a proof of concept script (for echo task) sometime in
> the next week (if work doesn't get in the way too much).  After that
> I'll see how easy a refactoring job will be for making it generic.

>From above, you can see that I envision the possibility of the HTML
manual to remain, so it's all the more important that the transform is
robust.

> Right now I'm working with the source HTML as is (yes adding divs etc
> would help immensely) - to see how difficult it is.  I've got a fair
> amount done, but the tokenizer I'm using is being a little greedy and
> I've got the examples table as part of the parameters (oops).

Talks about the tokenizer being too greedy make me uneasy ;-) Leave
the parsing to existing parsing tool, and just manipulate the
structure of the document once it's been "reformatted" to a SAX event
stream. In this form, it feeds easily and naturally to an XSL
transform pipeline.

That's my view of the whole thing anyway ;-) --DD

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to