[
https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15571456#comment-15571456
]
Konstantin Avdeev commented on CONNECTORS-1325:
-----------------------------------------------
An important update!
I tested the "bad" char again by looking into the network traffic (http wire =
DEBUG), to make sure what exactly comes from Sharpoint:
and it turned out, that this emoji char gets translated into a "wrong" format
on MCF side: & # 128512; ---> & # xD83D;& # xDE00;
{code}
DEBUG 2016-10-13 11:39:45,460 (Thread-2572) - http-outgoing-100 << "#'
ows__ModerationStatus='0' ows__Level='1' ows_Title='Task emoji
>>>😀<<<'
ows_UniqueId='5;#{8F6DF977-9814-4AA0-B7AE-E29838C508CF}'
ows_owshiddenversion='3' ows_FSObjType='5;#0' ows_PermMask='0x7fffffffffffffff'
ows_FileRef='5;#sites/test-team/Lists/Main Task List/5_.000' />[\r][\n]"
...
DEBUG 2016-10-13 11:39:45,461 (Worker thread '45') - SharePoint: getListItems
FileRef value 'sites/test-team/Lists/Main Task List/5_.000', xml response:
'<ns1:listitems xmlns:s="uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882"
xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
xmlns:rs="urn:schemas-microsoft-com:rowset" xmlns:z="#RowsetSchema"
xmlns:ns1="http://schemas.microsoft.com/sharepoint/soap/">
<rs:data ItemCount="1">
<z:row ows_Modified="2016-10-13 10:24:51" ows_Created="2016-10-12 17:30:55"
ows_ID="5" ows_GUID="{E583E8D8-52A7-4CD8-8A5F-6354D57D1E40}" ows_MetaInfo="5;#"
ows__ModerationStatus="0" ows__Level="1" ows_Title="Task emoji
>>>��<<<"
ows_UniqueId="5;#{8F6DF977-9814-4AA0-B7AE-E29838C508CF}"
ows_owshiddenversion="3" ows_FSObjType="5;#0" ows_PermMask="0x7fffffffffffffff"
ows_FileRef="5;#sites/test-team/Lists/Main Task List/5_.000"/>
</rs:data>
</ns1:listitems>'
DEBUG 2016-10-13 11:39:45,494 (Worker thread '45') - SharePoint: Can't get
version of '/Main Task List///5_.000' because of bad XML characters(?)
{code}
and the code & #128512 is a valid XML 1.0 code!
Could you please take a look at the parser?
Thank you!
> Invalid XML character causing job to abort
> ------------------------------------------
>
> Key: CONNECTORS-1325
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
> Project: ManifoldCF
> Issue Type: Bug
> Components: SharePoint connector
> Affects Versions: ManifoldCF 2.3
> Reporter: Phil
> Assignee: Karl Wright
> Priority: Blocker
> Fix For: ManifoldCF 2.5
>
> Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch,
> CONNECTORS-1325.patch, mcf-bad-ms-char.xml
>
>
> The following error is causing the Manifold job to abort, and subsequently
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread -
> Exception tossed: XML parsing error: Character reference "�" is an
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error:
> Character reference "�" is an invalid XML character.
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
> at org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:286)
> at
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
> at
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64;
> Character reference "�" is an invalid XML character.
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
> ... 4 more
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)