The application is automated translation of HTML pages. I need to
extract text from the page, translate it and then recreate the original
HTML. All on a headless server.
I started using Mozilla so that I could focus just on the extraction of
the text and ignore the hassles of HTML parsing. I was also using the
XPCOM, NSPR, and charset detection/conversion code extensively. Perhaps
it is using a sledgehammer to accomplish the task of a tackhammer. At
the time it seemed the best way to do it. These days we have got rid of
nearly everything we were using from Mozilla except for HTML parsing and
charset detection.
Robert Sayre wrote:
Brodie Thiesfield wrote:
What I need to do is:
Step 1 needs more detail.
* parse a HTML document (or fragment)
Do you need scripts to execute? Should document.write work? etc.
No. I don't need any scripts, CSS, layout, etc to be done. Just the raw
HTML to DOM conversion. I then traverse the DOM translating the text and
extract the final result afterwards.
* traverse and modify the DOM
This should be possible.
Yes. I assume the current code should be able to be used unmodified.
* rewrite the HTML document to source
innerHTML or equivalent should do the trick once you have the DOM
resulting from step 1.
We currently use the nsIDocumentEncoder interface. I assume innerHTML
provides a similar style of access.
Do you think that using the embedding interface is possible?
Regards,
Brodie
_______________________________________________
dev-embedding mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-embedding