[jira] [Comment Edited] (CONNECTORS-1325) Invalid XML character causing job to abort

2016-10-13 Thread Konstantin Avdeev (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571237#comment-15571237
 ] 

Konstantin Avdeev edited comment on CONNECTORS-1325 at 10/13/16 8:43 AM:
-

The stackoverflow's thread you mentioned in the second message here, describes 
the problem quite well:
this character encoding was introduces in XML 1.1: 
https://www.w3.org/TR/xml11/#sec-xml11
and a possible solution is: setting the correct header: {code}{code}
I'm afraid, it would take ages to get this fixed by MS.

P.S. the correct XML prologue wont help with emojis, but at least it would 
solve the issue with our "record separator" :)

To be honest, I'm not sure what we could do here, I'm not a fan of workarounds. 
We could leave it as it is now, but could you probably change the "bad 
character" warnings to WARN level? Currently they are shown in DEBUG only, 
which could be misleading in a production environment.
Thanks!


was (Author: kavdeev):
The stackoverflow's thread you mentioned in the second message here, describes 
the problem quite well:
this character encoding was introduces in XML 1.1: 
https://www.w3.org/TR/xml11/#sec-xml11
and the solution is: setting the correct header: {code}{code}
I'm afraid, it would take ages to get this fixed by MS.

> Invalid XML character causing job to abort
> --
>
> Key: CONNECTORS-1325
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.3
>Reporter: Phil
>Assignee: Karl Wright
>Priority: Blocker
> Fix For: ManifoldCF 2.5
>
> Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, 
> CONNECTORS-1325.patch, mcf-bad-ms-char.xml
>
>
> The following error is causing the Manifold job to abort, and subsequently 
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an 
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - 
> Exception tossed: XML parsing error: Character reference "" is an 
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: 
> Character reference "" is an invalid XML character.
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
> at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; 
> Character reference "" is an invalid XML character.
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
> ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort

2016-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571538#comment-15571538
 ] 

Karl Wright commented on CONNECTORS-1325:
-

[~kavdeev]: One of the numbers is decimal, the other is hex.  They are, 
however, the same non-unicode value.  There is nothing wrong with the parser.


> Invalid XML character causing job to abort
> --
>
> Key: CONNECTORS-1325
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.3
>Reporter: Phil
>Assignee: Karl Wright
>Priority: Blocker
> Fix For: ManifoldCF 2.5
>
> Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, 
> CONNECTORS-1325.patch, mcf-bad-ms-char.xml
>
>
> The following error is causing the Manifold job to abort, and subsequently 
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an 
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - 
> Exception tossed: XML parsing error: Character reference "" is an 
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: 
> Character reference "" is an invalid XML character.
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
> at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; 
> Character reference "" is an invalid XML character.
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
> ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort

2016-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571543#comment-15571543
 ] 

Karl Wright commented on CONNECTORS-1325:
-

[~kavdeev]: Unless your company cares deeply about non-unicode emojis, then can 
you instead attach the XML document that contains the  ?  The reason is that I 
want to see what context it is in.  The solution might be entirely different.

Thanks.

> Invalid XML character causing job to abort
> --
>
> Key: CONNECTORS-1325
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.3
>Reporter: Phil
>Assignee: Karl Wright
>Priority: Blocker
> Fix For: ManifoldCF 2.5
>
> Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, 
> CONNECTORS-1325.patch, mcf-bad-ms-char.xml
>
>
> The following error is causing the Manifold job to abort, and subsequently 
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an 
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - 
> Exception tossed: XML parsing error: Character reference "" is an 
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: 
> Character reference "" is an invalid XML character.
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
> at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; 
> Character reference "" is an invalid XML character.
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
> ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort

2016-10-13 Thread Konstantin Avdeev (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571456#comment-15571456
 ] 

Konstantin Avdeev commented on CONNECTORS-1325:
---

An important update!

I tested the "bad" char again by looking into the network traffic (http wire = 
DEBUG), to make sure what exactly comes from Sharpoint:

and it turned out, that this emoji char gets translated into a "wrong" format 
on MCF side: & # 128512; ---> & # xD83D;& # xDE00;

{code}
DEBUG 2016-10-13 11:39:45,460 (Thread-2572) - http-outgoing-100 << "#' 
ows__ModerationStatus='0' ows__Level='1' ows_Title='Task emoji 
' 
ows_UniqueId='5;#{8F6DF977-9814-4AA0-B7AE-E29838C508CF}' 
ows_owshiddenversion='3' ows_FSObjType='5;#0' ows_PermMask='0x7fff' 
ows_FileRef='5;#sites/test-team/Lists/Main Task List/5_.000' />[\r][\n]"
...
DEBUG 2016-10-13 11:39:45,461 (Worker thread '45') - SharePoint: getListItems 
FileRef value 'sites/test-team/Lists/Main Task List/5_.000', xml response: 
'http://schemas.microsoft.com/sharepoint/soap/;>

   

'
DEBUG 2016-10-13 11:39:45,494 (Worker thread '45') - SharePoint: Can't get 
version of '/Main Task List///5_.000' because of bad XML characters(?)
{code}

and the code & #128512 is a valid XML 1.0 code!

Could you please take a look at the parser?
Thank you!

> Invalid XML character causing job to abort
> --
>
> Key: CONNECTORS-1325
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.3
>Reporter: Phil
>Assignee: Karl Wright
>Priority: Blocker
> Fix For: ManifoldCF 2.5
>
> Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, 
> CONNECTORS-1325.patch, mcf-bad-ms-char.xml
>
>
> The following error is causing the Manifold job to abort, and subsequently 
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an 
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - 
> Exception tossed: XML parsing error: Character reference "" is an 
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: 
> Character reference "" is an invalid XML character.
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
> at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; 
> Character reference "" is an invalid XML character.
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
> ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort

2016-10-13 Thread Konstantin Avdeev (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571573#comment-15571573
 ] 

Konstantin Avdeev commented on CONNECTORS-1325:
---

hex format is not a valid XML: http://www.w3schools.com/xml/xml_validator.asp
the decimal format has no issues.
Thanks!

> Invalid XML character causing job to abort
> --
>
> Key: CONNECTORS-1325
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.3
>Reporter: Phil
>Assignee: Karl Wright
>Priority: Blocker
> Fix For: ManifoldCF 2.5
>
> Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, 
> CONNECTORS-1325.patch, mcf-bad-ms-char.xml
>
>
> The following error is causing the Manifold job to abort, and subsequently 
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an 
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - 
> Exception tossed: XML parsing error: Character reference "" is an 
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: 
> Character reference "" is an invalid XML character.
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
> at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; 
> Character reference "" is an invalid XML character.
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
> ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CONNECTORS-1325) Invalid XML character causing job to abort

2016-10-13 Thread Konstantin Avdeev (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571579#comment-15571579
 ] 

Konstantin Avdeev edited comment on CONNECTORS-1325 at 10/13/16 10:55 AM:
--

The company does not use emojis :)
It was just an example how to reproduce the issue.
The original problem was with the "record separator" char, which is used in 
some libraries.

BTW: & #30 - (decimal) - not a valid XML char too: 
http://www.w3schools.com/xml/xml_validator.asp



was (Author: kavdeev):
The company does not use emojis :)
It was just an example how to reproduce the issue.
The original problem was with the "record separator" char, which is used in 
some libraries.

> Invalid XML character causing job to abort
> --
>
> Key: CONNECTORS-1325
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.3
>Reporter: Phil
>Assignee: Karl Wright
>Priority: Blocker
> Fix For: ManifoldCF 2.5
>
> Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, 
> CONNECTORS-1325.patch, mcf-bad-ms-char.xml
>
>
> The following error is causing the Manifold job to abort, and subsequently 
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an 
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - 
> Exception tossed: XML parsing error: Character reference "" is an 
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: 
> Character reference "" is an invalid XML character.
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
> at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; 
> Character reference "" is an invalid XML character.
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
> ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort

2016-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571653#comment-15571653
 ] 

Karl Wright commented on CONNECTORS-1325:
-

Hi [~kavdeev]: There is no "translation" happening on the MCF side.  Axis 1.4 
is used with httpcomponents/httpclient for transport.  Axis uses the registered 
XML provider which in this case is xerces.

The XML printed by the debug message is what is provided by Axis as the SOAP 
response. If Axis is rewriting the SOAP, that's not something we can address.  
We do not parse that SOAP response -- Axis does.  We just report it for 
debugging purposes.

There are two kinds of errors here, then.  The first kind is Axis rewriting the 
SOAP response in such a way that it is not parseable.  This is expected because 
the decimal character value is not standard Unicode; it cannot be represented 
as a Java character.  (The 'unicode' value is 1F600).  So even though the XML 
is legal, the XML cannot be parsed by Java because it is limited to standard 
unicode.

The second kind of problem is that including an entity reference in the XML 
itself (not a field) is not allowed.  This is the case you actually care about 
if I understand correctly.  Unfortunately, if the XML is illegal, the xml 
parser will fail to parse it.  That's the end of the story, I'm afraid.


> Invalid XML character causing job to abort
> --
>
> Key: CONNECTORS-1325
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.3
>Reporter: Phil
>Assignee: Karl Wright
>Priority: Blocker
> Fix For: ManifoldCF 2.5
>
> Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, 
> CONNECTORS-1325.patch, mcf-bad-ms-char.xml
>
>
> The following error is causing the Manifold job to abort, and subsequently 
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an 
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - 
> Exception tossed: XML parsing error: Character reference "" is an 
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: 
> Character reference "" is an invalid XML character.
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
> at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; 
> Character reference "" is an invalid XML character.
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
> ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)