Re: [MarkLogic Dev General] How to handle named HTMLcharacter entities when loading an ISO-8859-1 encoded document into MarkLogic?

Tim Meagher Mon, 05 Jul 2010 10:34:37 -0700

Hi Mary,


Ah yes, using the full repair option does cause it to recognize the &sim;
entity.  But I'm back to the XDMP-DOCUNEOF  error, i.e.,

 

2010-07-05 13:21:16.682 Notice: TaskServer: XDMP-DOCUNEOF:
http://[server]/[doc-path]/[doc-name].xml 536 document 1

 

What does the number 536 mean in the error message?  

 

This only happens with some of the ISO-8859-1 encoded content that is being
processed.  I thought it might be a problem with the document, but if I pull
up the URL in oxygen using the location URL, the entire document is fetched
without a problem.  I also tried getting it from a couple of different
servers and still get the same error.  My next step will be to try
truncating the content and seeing if I can identify a problematic set of
characters or t see if size is an issue.

 

Thank you!

 

Tim

 

-----Original Message-----
From: Mary Holstege [mailto:[email protected]] 
Sent: Monday, July 05, 2010 12:48 PM
To: 'General Mark Logic Developer Discussion'; Tim Meagher
Subject: Re: [MarkLogic Dev General] How to handle named HTMLcharacter
entities when loading an ISO-8859-1 encoded document into MarkLogic?

 

On Mon, 05 Jul 2010 03:58:31 -0700, Tim Meagher <[email protected]> wrote:

 

> Hi Geert,

> 

> 

> Interesting.  I checked into the document and noticed that it references  

> a

> DTD that references entities defined in files separate from the DTD.

> 

> 

> Thanks,

> 

> 

> Tim

 

If you load with the "repair" option, MLS will recognize a host of  

character

entities, including "sim".  If you want them returned to you on output,

make the appropriate selection for SGML entities on your application  

server.

(The full list is taken from the W3C XML Entity Definition recommendation.)

 

If you have your own special entities, they need to be in the internal

subset.

 

//Mary

 

> 

> -----Original Message-----

> From: [email protected]

> [mailto:[email protected]] On Behalf Of Geert  

> Josten

> Sent: Monday, July 05, 2010 6:43 AM

> To: General Mark Logic Developer Discussion

> Subject: Re: [MarkLogic Dev General] How to handle named HTMLcharacter

> entities when loading an ISO-8859-1 encoded document into MarkLogic?

> 

> 

> Hi Tim,

> 

> 

> To my knowledge, MarkLogic Server only accepts the five default XML named

> entity (lt, gt, amp, apos, quot) by default, and any other named entities

> added to the local declaration subset. External declarations are ignored.

>> From the top of my head the local declaration should look something  

>> like the

> following, add it directly after the XML declaration:

> 

> 

> <!DOCTYPE {name_of_root} PUBLIC "some_pub_id" [

> 

> 

> <!ENTITY sim CDATA "&#x0223C;">

> 

> 

> ]>

> 

> 

> It might be easier though to put a proxy-service in between (if  

> possible),

> that normalizes encoding, as well as resolves these entities (which  

> usually

> only requires parsing the XML with a DTD declaration)..

> 

> 

> Kind regards,

> 

> Geert

> 

> 

>> 

> 

> 

> 

> drs. G.P.H. (Geert) Josten

> 

> Consultant

> 

> 

> Daidalos BV

> 

> Hoekeindsehof 1-4

> 

> 2665 JZ Bleiswijk

> 

> 

> T +31 (0)10 850 1200

> 

> F +31 (0)10 850 1199

> 

> 

> mailto:[email protected]

> 

> http://www.daidalos.nl/

> 

> 

> KvK 27164984

> 

> 

> 

> De informatie - verzonden in of met dit e-mailbericht - is afkomstig van

> Daidalos BV en is uitsluitend bestemd voor de geadresseerde. Indien u dit

> bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen. Aan

> dit bericht kunnen geen rechten worden ontleend.

> 

> 

>> From: [email protected]

> 

>> [mailto:[email protected]] On Behalf Of

> 

>> Tim Meagher

> 

>> Sent: maandag 5 juli 2010 12:21

> 

>> To: 'General Mark Logic Developer Discussion'

> 

>> Subject: [MarkLogic Dev General] How to handle named

> 

>> HTMLcharacter entities when loading an ISO-8859-1 encoded

> 

>> document into MarkLogic?

> 

>> 

> 

>> Hi Folks,

> 

>> 

> 

>> 

> 

>> 

> 

>> I am using xdmp:document-load to insert content into

> 

>> MarkLogic.  Until recently I had only been loading UTF-8 XML

> 

>> into the database, but recently started encountering some

> 

>> ISO-8859-1 encoded content.  I was able to adjust the

> 

>> xdmp:document-load options to accommodate ISO-8859-1 and for

> 

>> the most part it has been working okay; however, the

> 

>> ISO-8859-1 content occasionally includes HTML character

> 

>> entities such as &sim; which appears to be causing the load

> 

>> to fail (which subsequently is generating an XDMP-DOCUNEOF

> 

>> error message when the error is not trapped with a try-catch

> 

>> block but generates an XDMP-DOCENTITYREF error message when

> 

>> the error is trapped with a try-catch block).

> 

>> 

> 

>> 

> 

>> 

> 

>> Is there a simple way to add a list of character entity

> 

>> mappings to get this to work?  For example, I've read that

> 

>> &sim; maps to the Unicode character U+0223C

> 

>> <http://www.fileformat.info/info/unicode/char/223c/index.htm>

> 

>>  (http://code.google.com/p/doctype/wiki/SimCharacterEntity).

> 

>> 

> 

>> 

> 

>> 

> 

>> Thanks ahead of time for any help with this!

> 

>> 

> 

>> 

> 

>> 

> 

>> Tim Meagher

> 

>> 

> 

>> 

> 

>> 

> 

>> 

> 

> _______________________________________________

> 

> General mailing list

> 

> [email protected]

> 

> http://developer.marklogic.com/mailman/listinfo/general

> 

 

 

-- 

Using Opera's revolutionary email client: http://www.opera.com/mail/

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] How to handle named HTMLcharacter entities when loading an ISO-8859-1 encoded document into MarkLogic?

Reply via email to