[ 
https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15571456#comment-15571456
 ] 

Konstantin Avdeev commented on CONNECTORS-1325:
-----------------------------------------------

An important update!

I tested the "bad" char again by looking into the network traffic (http wire = 
DEBUG), to make sure what exactly comes from Sharpoint:

and it turned out, that this emoji char gets translated into a "wrong" format 
on MCF side: & # 128512; ---> & # xD83D;& # xDE00;

{code}
DEBUG 2016-10-13 11:39:45,460 (Thread-2572) - http-outgoing-100 << "#' 
ows__ModerationStatus='0' ows__Level='1' ows_Title='Task emoji 
&gt;&gt;&gt;&#128512;&lt;&lt;&lt;' 
ows_UniqueId='5;#{8F6DF977-9814-4AA0-B7AE-E29838C508CF}' 
ows_owshiddenversion='3' ows_FSObjType='5;#0' ows_PermMask='0x7fffffffffffffff' 
ows_FileRef='5;#sites/test-team/Lists/Main Task List/5_.000' />[\r][\n]"
...
DEBUG 2016-10-13 11:39:45,461 (Worker thread '45') - SharePoint: getListItems 
FileRef value 'sites/test-team/Lists/Main Task List/5_.000', xml response: 
'<ns1:listitems xmlns:s="uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882" 
xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" 
xmlns:rs="urn:schemas-microsoft-com:rowset" xmlns:z="#RowsetSchema" 
xmlns:ns1="http://schemas.microsoft.com/sharepoint/soap/";>
<rs:data ItemCount="1">
   <z:row ows_Modified="2016-10-13 10:24:51" ows_Created="2016-10-12 17:30:55" 
ows_ID="5" ows_GUID="{E583E8D8-52A7-4CD8-8A5F-6354D57D1E40}" ows_MetaInfo="5;#" 
ows__ModerationStatus="0" ows__Level="1" ows_Title="Task emoji 
&gt;&gt;&gt;&#xD83D;&#xDE00;&lt;&lt;&lt;" 
ows_UniqueId="5;#{8F6DF977-9814-4AA0-B7AE-E29838C508CF}" 
ows_owshiddenversion="3" ows_FSObjType="5;#0" ows_PermMask="0x7fffffffffffffff" 
ows_FileRef="5;#sites/test-team/Lists/Main Task List/5_.000"/>
</rs:data>
</ns1:listitems>'
DEBUG 2016-10-13 11:39:45,494 (Worker thread '45') - SharePoint: Can't get 
version of '/Main Task List///5_.000' because of bad XML characters(?)
{code}

and the code & #128512 is a valid XML 1.0 code!

Could you please take a look at the parser?
Thank you!

> Invalid XML character causing job to abort
> ------------------------------------------
>
>                 Key: CONNECTORS-1325
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: SharePoint connector
>    Affects Versions: ManifoldCF 2.3
>            Reporter: Phil
>            Assignee: Karl Wright
>            Priority: Blocker
>             Fix For: ManifoldCF 2.5
>
>         Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, 
> CONNECTORS-1325.patch, mcf-bad-ms-char.xml
>
>
> The following error is causing the Manifold job to abort, and subsequently 
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an 
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - 
> Exception tossed: XML parsing error: Character reference "&#xD83D" is an 
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: 
> Character reference "&#xD83D" is an invalid XML character.
>         at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
>         at org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:286)
>         at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
>         at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
>         at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; 
> Character reference "&#xD83D" is an invalid XML character.
>         at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>         at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>         at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>         at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
>         ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to