Re: parse and dom tree

Brodie Thiesfield Tue, 12 Dec 2006 10:45:18 -0800

The application is automated translation of HTML pages. I need toextract text from the page, translate it and then recreate the originalHTML. All on a headless server.

I started using Mozilla so that I could focus just on the extraction ofthe text and ignore the hassles of HTML parsing. I was also using theXPCOM, NSPR, and charset detection/conversion code extensively. Perhapsit is using a sledgehammer to accomplish the task of a tackhammer. Atthe time it seemed the best way to do it. These days we have got rid ofnearly everything we were using from Mozilla except for HTML parsing andcharset detection.


Robert Sayre wrote:

Brodie Thiesfield wrote:


What I need to do is:


Step 1 needs more detail.

  * parse a HTML document (or fragment)


Do you need scripts to execute? Should document.write work? etc.

No. I don't need any scripts, CSS, layout, etc to be done. Just the rawHTML to DOM conversion. I then traverse the DOM translating the text andextract the final result afterwards.

  * traverse and modify the DOM


This should be possible.


Yes. I assume the current code should be able to be used unmodified.

  * rewrite the HTML document to source
innerHTML or equivalent should do the trick once you have the DOMresulting from step 1.

We currently use the nsIDocumentEncoder interface. I assume innerHTMLprovides a similar style of access.


Do you think that using the embedding interface is possible?

Regards,
Brodie
_______________________________________________
dev-embedding mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-embedding

Re: parse and dom tree

Reply via email to