Great questions, I will keep it short ...
Doing pure UTF8 is not a problem at all - with no entities - as long as you 
configure your HTML page to have UTF8 encoding and follow that all the way 
through.
*escaping* HTML chars is not a problem , when you get data directly from a form 
POST into marklogic its handled for you, you get a string in UTF8 and
all appropriate encoding/escaping is done  e.g. user can enter "This & That"  
and your code doesn’t need to manage it *as long as you put it directly into 
XML"  (or xhtml).
Like

    <p>{ $user-typed-text }</p>


Now beyond that ... you are entering a rich world of choices (and complexity) 
... The range is wide open - but so is the task.
Nothing particular about ML will solve this or hinder it.
Typically what I have seen done is make use of either  a middle tier app server 
that handles "rich text" or a JavaScript library that does so.
These handle all the encoding and WYSIWYG issues ... you don’t really want 
people embedding  entities as HTML markup ... that’s a bad road to take ...
soon you will find you need   <b>Bold</b>  then <a 
href="http://www.crashyourcomputer.com";>Click here for a discount code</a> etc.

If you WANT users to enter full HTML or XHTML you can ... but it is very risky 
... the encoding is easy (xdmp:unquote( $user-text ))
but its almost never what you really  want to expose to users.

Instead a smart rich text editing field is what people use ...
I don’t know what the latest best is, and it depends on your language and tool 
choices, but I have used this:
http://www.gwtproject.org/doc/latest/DevGuideUiWidgets.html   (see Rich Text)


And a google turned up
http://www.sitepoint.com/html5-wysiwyg/
http://ckeditor.com/
http://yuilibrary.com/


I personally would recommend either sticking to pure (Unicode) text or using a 
tool or library that does it well ,
anything in between is going to be a non-ending struggle.

-----------------------------------------------------------------------------
David Lee
Lead Engineer
MarkLogic Corporation
[email protected]
Phone: +1 812-482-5224
Cell:  +1 812-630-7622
www.marklogic.com<http://www.marklogic.com/>




From: [email protected] 
[mailto:[email protected]] On Behalf Of Tim
Sent: Tuesday, October 07, 2014 4:24 AM
To: 'MarkLogic Developer Discussion'
Subject: [MarkLogic Dev General] Handling HTML entry of encoded characters for 
entry into XML

Hi Folks,

I am creating an HTML entry form for inputting text that can extend beyond the 
ASCII range, so the trick is standardizing the input of entities, and of course 
what to do with the ampersand character.  There are 2 parts to this challenge:


1.       Creating the text entry UI and providing rules for inputting entities 
as well as detecting and reporting invalid entries, and

2.       Converting the inputted entities into their corresponding UTF-8 value 
for storage in MarkLogic, especially so that the exported values can be 
converted back into the appropriate entities for html display or for export 
such as to a Microsoft Word document.

It seems that I cannot have my cake and eat it too, for example if I want to 
allow the user to simply insert a title with an ampersand they could enter:
                Red & White

But if I want to allow them to enter other encoded values such as:

                &ldquo; Red & White&#8221;

Then there needs to be the expectation that entering and ampersand by itself is 
disallowed, that the former must be supplied as

                Red &amp; White

So how do folks tend to deal with this issue for each of the parts that I 
describe above?

Thanks for any help with this. It seems like a simple issue but that has a lot 
of complexity, especially when folks allow proprietary named and numbered html 
encodings with  private use area Unicode mapping. Is this the bane of UI entry 
for XML UTF-8 mapping or what? ☺

Tim M.

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to