Bruno Dumon wrote: ...
I have done some work on this. I have first written a js html editor for IE (>5.5) to be used in an XML content management system. For this we needed to clean the html and convert it to xhtml in order to be able to process it with xslt upon displaying pages.* different users of the widget (like the doco project vs the project where we need it) will likely require different subsets of HTML to be used.
* support for both Mozilla and IE is important. Other browsers should fall back to a textarea with raw HTML in it.
* the HTML produced by the editor should be cleaned (i.e. not supported tags & attributes removed) and normalized (formatted). The goal of this is to deliver a nice XHTML-subset-doc for storage, and to show nice HTML to people editing it manually. Hopefully this will also make it possible to do meaningful text-based diffs.
One approach that I've tried is to generate the xhtml from the browser dom page with javascript, i.e. walk the tree and recursively generate <TAG> ... </TAG> entries, while surrounding all attributes with quotes. This could then be postprocessed on the server by parsing it with an XML parser and manipulating the DOM tree. This however proved to be a slight nightmare due to js/dom bugs in IE 5.5, if you'd be willing to drop 5.5 support it would be easier, but it might also be possible to do this using more specific IE js constructions with which I'm not particular familiar.
Eventually we ended up doing this completely server side, I wrote one component to fix the html to be xhtml and after that I use an XML parser to remove all unwanted attributes and tags.
The biggest problem while handling the html is that you also have to parse Word html that is pasted into the editor, and the html that Word produces is truly gruesome!
While the server side solution works well for all html garbage that I have encountered until now, it is not completely satisfactory because when you paste the html into the editor you're looking at the unprocessed html, when it has been processed by the server a lot will have been removed and it can look rather different. One could try to explain this to the user, but it's better to filter the html directly after pasting it, so the user will not get confused.
I'm now in the process of writing an editor component that can handle IE and Mozilla. It is in a working state, but the code needs to cleaned and some stuff needs to be written (a table editor, a url editor, etc.), it is however for a closed source system. I could discuss it to see if we would be willing to release it as open source.
This won't work, you need valid xml to use xsl, and the IE html in particular can be very troublesome to fix.My first thought was to do this cleanup stuff serverside (could be as simple as an XSL, which would make it easily customisable too). However it seems like you want to do all that on the client side?
* Currently in e.g. Linotype the source for the editor (thus of the iframe) is fetched separately from the main page. This is harder to do with cforms since then the pipeline from which the content is fetched should also have access to the cforms Form which is stored somewhere in a variable in a flowscript. For the cforms widget it would be easier I think to embed the HTML directly in the page (e.g. as a Javascript variable). This also makes it possible to assign the content either to the html editor or the textarea depending on what the client supports.
* Automatic image upload: still need to think more about this. After pressing the submit button (and afterwards possibly showing the form again), the images will need to become available in the URL space. How that's done will probably differ from application to application so we could put that behaviour behind an interface.
This is an interesting problem, Stefano talked about embedding it into the document, how would you want to do this? That would be the best solution for an embeddable component!
* wiki syntax support: we have no need for this, so don't expect any effort from me on that.
Regards, Marc.