[
https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345898#comment-15345898
]
Karl Wright commented on CONNECTORS-1325:
-----------------------------------------
It looks like what you have here is a utf-16-encoded character in a
supplementary plane:
https://en.wikipedia.org/wiki/UTF-16
As such, it should be encoded in XML as a single character with six hex digits.
However, even if that were represented correctly, that still might not matter
since Java's character representation cannot represent such characters.
However, it might be possible to tell Xerces to ignore such characters if they
were properly encoded, at least.
In lieu of that, skipping the document because there are bad characters in it
would seem to be the only reasonable option. However it will be brute force
because *any* parsing error would have to be presumed to be a character issue.
> Invalid XML character causing job to abort
> ------------------------------------------
>
> Key: CONNECTORS-1325
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
> Project: ManifoldCF
> Issue Type: Bug
> Components: SharePoint connector
> Affects Versions: ManifoldCF 2.3
> Reporter: Phil
> Assignee: Karl Wright
> Priority: Blocker
>
> The following error is causing the Manifold job to abort, and subsequently
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread -
> Exception tossed: XML parsing error: Character reference "�" is an
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error:
> Character reference "�" is an invalid XML character.
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
> at org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:286)
> at
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
> at
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64;
> Character reference "�" is an invalid XML character.
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
> ... 4 more
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)