On Mon, 05 Jul 2010 10:34:19 -0700, Tim Meagher <[email protected]> wrote:

> Hi Mary,
>
>
> Ah yes, using the full repair option does cause it to recognize the &sim;
> entity.  But I'm back to the XDMP-DOCUNEOF  error, i.e.,
>
>
> 2010-07-05 13:21:16.682 Notice: TaskServer: XDMP-DOCUNEOF:
> http://[server]/[doc-path]/[doc-name].xml 536 document 1
>
>
> What does the number 536 mean in the error message?

The error in general means that the XML parser thinks there is an
open element when it his the end of the file.  I think the 536 is
a line number.  It could be that the parser is somehow getting bad
information from the encoding transcoder, or that there is something
actually wonky with the files, or that repair thinks there is something
wonky with them, but it is wrong and getting confused.  What you
might try is transcoding them to UTF8 (you can load as text as 8859-1
and then save them out again) to see if they are getting truncated
somehow.  If so, then transcoding is going wrong.  If repair is
going wrong then there may be suspicious "repair inserting element bar"
line in the log.  I would expect Oxygen to complain if it isn't
well-formed XML, but you could try some other XML parser too, just to
be sure.

//Mary

>
>
> This only happens with some of the ISO-8859-1 encoded content that is  
> being
> processed.  I thought it might be a problem with the document, but if I  
> pull
> up the URL in oxygen using the location URL, the entire document is  
> fetched
> without a problem.  I also tried getting it from a couple of different
> servers and still get the same error.  My next step will be to try
> truncating the content and seeing if I can identify a problematic set of
> characters or t see if size is an issue.
>
>
> Thank you!
>
>
> Tim
>
>
> -----Original Message-----
> From: Mary Holstege [mailto:[email protected]]
> Sent: Monday, July 05, 2010 12:48 PM
> To: 'General Mark Logic Developer Discussion'; Tim Meagher
> Subject: Re: [MarkLogic Dev General] How to handle named HTMLcharacter
> entities when loading an ISO-8859-1 encoded document into MarkLogic?
>
>
> On Mon, 05 Jul 2010 03:58:31 -0700, Tim Meagher <[email protected]> wrote:
>
>
>> Hi Geert,
>
>>
>
>>
>
>> Interesting.  I checked into the document and noticed that it references
>
>> a
>
>> DTD that references entities defined in files separate from the DTD.
>
>>
>
>>
>
>> Thanks,
>
>>
>
>>
>
>> Tim
>
>
> If you load with the "repair" option, MLS will recognize a host of
>
> character
>
> entities, including "sim".  If you want them returned to you on output,
>
> make the appropriate selection for SGML entities on your application
>
> server.
>
> (The full list is taken from the W3C XML Entity Definition  
> recommendation.)
>
>
> If you have your own special entities, they need to be in the internal
>
> subset.
>
>
> //Mary
>
>
>>
>
>> -----Original Message-----
>
>> From: [email protected]
>
>> [mailto:[email protected]] On Behalf Of Geert
>
>> Josten
>
>> Sent: Monday, July 05, 2010 6:43 AM
>
>> To: General Mark Logic Developer Discussion
>
>> Subject: Re: [MarkLogic Dev General] How to handle named HTMLcharacter
>
>> entities when loading an ISO-8859-1 encoded document into MarkLogic?
>
>>
>
>>
>
>> Hi Tim,
>
>>
>
>>
>
>> To my knowledge, MarkLogic Server only accepts the five default XML  
>> named
>
>> entity (lt, gt, amp, apos, quot) by default, and any other named  
>> entities
>
>> added to the local declaration subset. External declarations are  
>> ignored.
>
>>> From the top of my head the local declaration should look something
>
>>> like the
>
>> following, add it directly after the XML declaration:
>
>>
>
>>
>
>> <!DOCTYPE {name_of_root} PUBLIC "some_pub_id" [
>
>>
>
>>
>
>> <!ENTITY sim CDATA "&#x0223C;">
>
>>
>
>>
>
>> ]>
>
>>
>
>>
>
>> It might be easier though to put a proxy-service in between (if
>
>> possible),
>
>> that normalizes encoding, as well as resolves these entities (which
>
>> usually
>
>> only requires parsing the XML with a DTD declaration)..
>
>>
>
>>
>
>> Kind regards,
>
>>
>
>> Geert
>
>>
>
>>
>
>>>
>
>>
>
>>
>
>>
>
>> drs. G.P.H. (Geert) Josten
>
>>
>
>> Consultant
>
>>
>
>>
>
>> Daidalos BV
>
>>
>
>> Hoekeindsehof 1-4
>
>>
>
>> 2665 JZ Bleiswijk
>
>>
>
>>
>
>> T +31 (0)10 850 1200
>
>>
>
>> F +31 (0)10 850 1199
>
>>
>
>>
>
>> mailto:[email protected]
>
>>
>
>> http://www.daidalos.nl/
>
>>
>
>>
>
>> KvK 27164984
>
>>
>
>>
>
>>
>
>> De informatie - verzonden in of met dit e-mailbericht - is afkomstig van
>
>> Daidalos BV en is uitsluitend bestemd voor de geadresseerde. Indien u  
>> dit
>
>> bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen.  
>> Aan
>
>> dit bericht kunnen geen rechten worden ontleend.
>
>>
>
>>
>
>>> From: [email protected]
>
>>
>
>>> [mailto:[email protected]] On Behalf Of
>
>>
>
>>> Tim Meagher
>
>>
>
>>> Sent: maandag 5 juli 2010 12:21
>
>>
>
>>> To: 'General Mark Logic Developer Discussion'
>
>>
>
>>> Subject: [MarkLogic Dev General] How to handle named
>
>>
>
>>> HTMLcharacter entities when loading an ISO-8859-1 encoded
>
>>
>
>>> document into MarkLogic?
>
>>
>
>>>
>
>>
>
>>> Hi Folks,
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>> I am using xdmp:document-load to insert content into
>
>>
>
>>> MarkLogic.  Until recently I had only been loading UTF-8 XML
>
>>
>
>>> into the database, but recently started encountering some
>
>>
>
>>> ISO-8859-1 encoded content.  I was able to adjust the
>
>>
>
>>> xdmp:document-load options to accommodate ISO-8859-1 and for
>
>>
>
>>> the most part it has been working okay; however, the
>
>>
>
>>> ISO-8859-1 content occasionally includes HTML character
>
>>
>
>>> entities such as &sim; which appears to be causing the load
>
>>
>
>>> to fail (which subsequently is generating an XDMP-DOCUNEOF
>
>>
>
>>> error message when the error is not trapped with a try-catch
>
>>
>
>>> block but generates an XDMP-DOCENTITYREF error message when
>
>>
>
>>> the error is trapped with a try-catch block).
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>> Is there a simple way to add a list of character entity
>
>>
>
>>> mappings to get this to work?  For example, I've read that
>
>>
>
>>> &sim; maps to the Unicode character U+0223C
>
>>
>
>>> <http://www.fileformat.info/info/unicode/char/223c/index.htm>
>
>>
>
>>>  (http://code.google.com/p/doctype/wiki/SimCharacterEntity).
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>> Thanks ahead of time for any help with this!
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>> Tim Meagher
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>> _______________________________________________
>
>>
>
>> General mailing list
>
>>
>
>> [email protected]
>
>>
>
>> http://developer.marklogic.com/mailman/listinfo/general
>
>>
>
>
>


-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to