The application is automated translation of HTML pages. I need to extract text from the page, translate it and then recreate the original HTML. All on a headless server.

I started using Mozilla so that I could focus just on the extraction of the text and ignore the hassles of HTML parsing. I was also using the XPCOM, NSPR, and charset detection/conversion code extensively. Perhaps it is using a sledgehammer to accomplish the task of a tackhammer. At the time it seemed the best way to do it. These days we have got rid of nearly everything we were using from Mozilla except for HTML parsing and charset detection.

Robert Sayre wrote:
Brodie Thiesfield wrote:

What I need to do is:

Step 1 needs more detail.

  * parse a HTML document (or fragment)

Do you need scripts to execute? Should document.write work? etc.

No. I don't need any scripts, CSS, layout, etc to be done. Just the raw HTML to DOM conversion. I then traverse the DOM translating the text and extract the final result afterwards.

  * traverse and modify the DOM

This should be possible.

Yes. I assume the current code should be able to be used unmodified.

  * rewrite the HTML document to source

innerHTML or equivalent should do the trick once you have the DOM resulting from step 1.

We currently use the nsIDocumentEncoder interface. I assume innerHTML provides a similar style of access.

Do you think that using the embedding interface is possible?

Regards,
Brodie
_______________________________________________
dev-embedding mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-embedding

Reply via email to