Arrgh, more XML/HTML problems now it's '&'

Justin Fagnani-Bell Sat, 10 Aug 2002 14:41:52 -0700

Hi again,

   <warning> this is a long post </warning>


   I'm still working on HTML forms where the user (me for the moment:) is 
supposed to input HTML into a text area that will be stored in an XML 
format. I'm still having problems, so I haven't written a SUMMARY post...

My new problem occurred last night when I'm testing the system and I put 
in an anchor tag with a url that has request parameters... like this:

<a href="http://www.something.net/apage.jsp?p1=hi&p2=bye";>link</a>

Well, when I hit submit the form is supposed to come back filled out, 
but instead I get an error that states "the entity 'p2' must end with 
a ';'.

So I do some searching on on w3.org and sure enough URLs in XHTML have 
to use '&amp;' instead of '&'. Arrgh, I know this will cause problems 
once people who are used to normal HTML start using this. I'm 
considering writing a filter that will escape illegal characters on the 
way in, and un-escape them going back to the user, but that seems like a 
bit of a pain and combined with the problems I'm having making people 
type XML compliant HTML in the first place I'm wondering if there's a 
completely different way I could do this.

I'm sure someone else out there has come across these problems before. 
It seems inevitable when building a webapp where users can edit some 
content, that uses XML on the backend. The users only marginally know 
HTML in the first place and can't be expected to always follow the rules 
correctly every time. The app after all, is supposed to be easy to use.

I would love to start some discussion on different ideas for handling 
these types of problems. They must be common among Cocoon users, and 
maybe we can come up with a set of solutions (HOW-TO's, Java helper 
classes, taglibs) to make life easier on Cocoon developers and end-users.

Here's my little list of requirements, issues, and assumptions when 
dealing with forms, user input, and xml.

1) My users are used to HTML, not XML
2) My users are not fail proof, and are probably prone to occasional 
mistakes
3) Ideally I want them to be able to input HTML(non XML compliant), 
plain text, or XML (not HTML, but any XML. this is actually preferred, 
but sometimes users are just entering a news item, or a BBS post, and it 
seems reasonable to allow them to use HTML for formatting rather than 
inventing my own xml dialect)
4) The data is going to be in an XML document/SAX stream at some point
   (either stored that way, or stored in a database and turned into xml 
through a generator)
5) sometimes I want to run xsl transformations on the data when it is 
output.
6) when editing the data, I'd like to have it appear exactly as the user 
typed
7) but i'd also like to have the ability to clean it up (as on option)
8) The browsers like HTML 4 much better than XHTML, therefore the pages 
I send them work better if I use the HTMLSerializer

Here are some problems I've encountered so far.

1) users don't follow XML rules very well (goes along with point 1)
2) the HTMLSerializer changes the users data by turning <br/> into <br>, 
etc
3) the XML Serializer changes the users data by turning 
<textarea></textarea> into <textarea/>, etc
4) bad user input will cause SAXExceptions if it's not enclosed in CDATA 
sections

(oh, to clarify here, I typically have two pages which show the data, 
one is the 'edit' page with the form, the other is where the data 
actually shows up, the 'viewing page', the HTMLserializer is no problem 
on the viewing page, just the editing page)

Some of these points interfere with some solutions. For example, I could 
wrap the data in a CDATA section to get around XML compliance, but then 
I wouldn't be able to run XSL transformations on it (correct me if I'm 
wrong anywhere). Maybe I could check if the data is xml compliant and 
wrap it only if it isn't.

Here are some ideas for solutions:

1) Create a new HTMLSerializer that can selectively determine which tags 
it will convert into HTML and which is will leave alone. This way you 
could specify that all textarea tags and their contents shouldn't be 
touched (I would think this would be a reasonable default feature anyway)
2) Create a jTidy like program that will turn HTML into XHTML, but work 
for fragments (jTidy seems to only output complete HTML documents)
3) Create a class that can find an XML error, and report it nicely back 
to the user so they can fix it. (I recall a demo with Cocoon 1.8.x that 
had something like this...)

Hmm, these three things might do it. the new serializer would work for 
editing, the Tidy-like class work work for either storing the data as 
xml, or just viewing it as xml. I think I have an idea on how to do the 
serializer, but it wouldn't rely on a transformer like the current one. 
I looked at the code for jTidy and there's a ton of classes, so I've yet 
to fully comprehend how it works, it might already be able to do what i 
want, and like I said I saw something similar to 3) a year or so ago...

ok, that's my thoughts...

Justin



---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <[EMAIL PROTECTED]>
For additional commands, e-mail:   <[EMAIL PROTECTED]>

Arrgh, more XML/HTML problems now it's '&'

Reply via email to