> > Yes, I thought that would be the way to go as well down the line, > > parsing the HTML with tagsoup [...] > > when I say script here I'm talking about a ruby script (partially as I > saw this as being a converter for the manual for a limited amount of > time, not immediate throwaway, but not a piece of code that's going to > fester in svn for years after, partially because I know enough ruby to > get it done in it (and I don't know enough perl/python) > > also for light text processing I think Java is probably too > heavyweight (sledgehammer to crack a nut) and ruby has nice regexp > capabilities.
Java does regex just fine, albeit more verbose (when is Java not verbose ;-), but my main point is that you already have (Java) tools allow you to have an XML view of the existing HTML manual (tagsoup, etc...). Leave the parsing to these tools, and concentrate of transforming the "loose" HTML schema into a more structured XML, probably using XSL as the language rather than scripting. By adding a little more structure to the HTML with <div>s, the XML view of the HTML could be complete enough for robust transformation to XML, and perhaps even robust enough so that the HTML remains as the official "source" document of the manual (but stripped of all formatting, which would be added later in the XML processing pipeline). The main advantage of this would be that editing HTML using an HTML editor for manual editing can be easier/nicer and kinda wysiwyg, compared to editing the transformed XML. > I'm aiming for a proof of concept script (for echo task) sometime in > the next week (if work doesn't get in the way too much). After that > I'll see how easy a refactoring job will be for making it generic. >From above, you can see that I envision the possibility of the HTML manual to remain, so it's all the more important that the transform is robust. > Right now I'm working with the source HTML as is (yes adding divs etc > would help immensely) - to see how difficult it is. I've got a fair > amount done, but the tokenizer I'm using is being a little greedy and > I've got the examples table as part of the parameters (oops). Talks about the tokenizer being too greedy make me uneasy ;-) Leave the parsing to existing parsing tool, and just manipulate the structure of the document once it's been "reformatted" to a SAX event stream. In this form, it feeds easily and naturally to an XSL transform pipeline. That's my view of the whole thing anyway ;-) --DD --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]