Karl Wright commented on CONNECTORS-1325:

Hi [~kavdeev]: There is no "translation" happening on the MCF side.  Axis 1.4 
is used with httpcomponents/httpclient for transport.  Axis uses the registered 
XML provider which in this case is xerces.

The XML printed by the debug message is what is provided by Axis as the SOAP 
response. If Axis is rewriting the SOAP, that's not something we can address.  
We do not parse that SOAP response -- Axis does.  We just report it for 
debugging purposes.

There are two kinds of errors here, then.  The first kind is Axis rewriting the 
SOAP response in such a way that it is not parseable.  This is expected because 
the decimal character value is not standard Unicode; it cannot be represented 
as a Java character.  (The 'unicode' value is 1F600).  So even though the XML 
is legal, the XML cannot be parsed by Java because it is limited to standard 

The second kind of problem is that including an entity reference in the XML 
itself (not a field) is not allowed.  This is the case you actually care about 
if I understand correctly.  Unfortunately, if the XML is illegal, the xml 
parser will fail to parse it.  That's the end of the story, I'm afraid.

> Invalid XML character causing job to abort
> ------------------------------------------
>                 Key: CONNECTORS-1325
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: SharePoint connector
>    Affects Versions: ManifoldCF 2.3
>            Reporter: Phil
>            Assignee: Karl Wright
>            Priority: Blocker
>             Fix For: ManifoldCF 2.5
>         Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, 
> CONNECTORS-1325.patch, mcf-bad-ms-char.xml
> The following error is causing the Manifold job to abort, and subsequently 
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an 
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - 
> Exception tossed: XML parsing error: Character reference "&#xD83D" is an 
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: 
> Character reference "&#xD83D" is an invalid XML character.
>         at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
>         at org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:286)
>         at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
>         at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
>         at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; 
> Character reference "&#xD83D" is an invalid XML character.
>         at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>         at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>         at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>         at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
>         ... 4 more
> {code}

This message was sent by Atlassian JIRA

Reply via email to