Boris Zbarsky wrote:
Erik Walter wrote:

Is there a way to do that through the WebPersist interface of the embedding stuff?

Good question... Last I checked, Adam (nsIWebBrowserPersist owner) said that it's the responsibility of the caller to pass in a pointer to a document which was loaded with script turned _off_ to avoid this problem...  So in other words, one would want to get the doc from cache and reparse it, with script turned off, then pass that in.

Basically the deal is you can:
  1. Save some random URI, e.g. http://www.cnn.com/. What comes back is dumped straight to disk to a filename of your choice. No fixup is done and the data is saved as is.
  2. Save a DOM document. What goes to disk is the DOM reconstituted into HTML via a document encoder, complete with URLs fixed up for their local file locations and optionally any inline elements such as elements, subframes, stylesheets etc.
The reason that pages might double up items is because of _javascript_. Some pages document.write stuff out when they're loaded and the DOM given to the persist object already contains those written items. The next time you load the saved copy the JS in the page document.write's them again.

There is no easy way around this, though temporarily disabling JS might help in some cases.

Of course doing that is nontrivial....

I'm just doing a SaveDocument() right now, is there some additional encoding flag I need to set.

In terms of this interface, I was suggesting you do a saveURI.  But that will not save the images and CSS...

The basic problem is that the persistence object has no way to tell apart content written by script and content that is part of the original HTML, which is why this somehow needs to be the caller's responsibility.

-Boris
We need the unmolested DOM from somewhere before JS gets its hands on it, which means holding a copy around or somehow making one from the cached page data. The latter might be possible but the DOM parsing code expects presshells, docshells and other unpleasantness which makes it totally impractical at the moment. We really need a simply parser that can be created on the fly, handed a stream and a DOM pops out of the other end. It would be nice to know when a page is 'dirty' too so creating a fresh DOM would only be done when necessary.

There is a parser already for the XMLHttpRequest stuff, but one for HTML content is required too. There is a bug open on this, but the problem of making a simple DOM parser has not been solved yet.

http://bugzilla.mozilla.org/show_bug.cgi?id=115328

Adam

Reply via email to