Boris Zbarsky wrote:
Erik
Walter wrote:
Is there a way to do that through the
WebPersist interface of the embedding stuff?
Good question... Last I checked, Adam (nsIWebBrowserPersist owner) said
that it's the responsibility of the caller to pass in a pointer to a
document which was loaded with script turned _off_ to avoid this
problem... So in other words, one would want to get the doc from cache
and reparse it, with script turned off, then pass that in.
Basically the deal is you can:
- Save some random URI, e.g. http://www.cnn.com/. What comes back
is dumped straight to disk to a filename of your choice. No fixup is
done and the data is saved as is.
- Save a DOM document. What goes to disk is the DOM reconstituted
into HTML via a document encoder, complete with URLs fixed up for their
local file locations and optionally any inline elements such as
elements, subframes, stylesheets etc.
The reason that pages might double up items is because of _javascript_.
Some pages document.write stuff out when they're loaded and the DOM
given to the persist object already contains those written items. The
next time you load the saved copy the JS in the page document.write's
them again.
There is no easy way around this, though temporarily disabling JS might
help in some cases.
Of
course doing that is nontrivial....
I'm just doing a SaveDocument() right now, is
there some additional encoding flag I need to set.
In terms of this interface, I was suggesting you do a saveURI. But
that will not save the images and CSS...
The basic problem is that the persistence object has no way to tell
apart content written by script and content that is part of the
original HTML, which is why this somehow needs to be the caller's
responsibility.
-Boris
We need the unmolested DOM from somewhere before JS gets its hands on
it, which means holding a copy around or somehow making one from the
cached page data. The latter might be possible but the DOM parsing code
expects presshells, docshells and other unpleasantness which makes it
totally impractical at the moment. We really need a simply parser that
can be created on the fly, handed a stream and a DOM pops out of the
other end. It would be nice to know when a page is 'dirty' too so
creating a fresh DOM would only be done when necessary.
There is a parser already for the XMLHttpRequest stuff, but one for
HTML content is required too. There is a bug open on this, but the
problem of making a simple DOM parser has not been solved yet.
http://bugzilla.mozilla.org/show_bug.cgi?id=115328
Adam
|