Julien Massiera created CONNECTORS-1655: -------------------------------------------
Summary: Web connector - UnsupportedEncodingException utf-8 Key: CONNECTORS-1655 URL: https://issues.apache.org/jira/browse/CONNECTORS-1655 Project: ManifoldCF Issue Type: Bug Components: Web connector Affects Versions: ManifoldCF 2.17 Reporter: Julien Massiera When crawling some sites (for instance this one: [http://www.antibes-juanlespins.com/] ) the job manages to index some documents, but the stops with the following error code: Error: IO error: utf-8; filename=rseventspro_rss20_56.xml Here is one the MCF stacktrace: Exception tossed: IO error: utf-8; filename=rseventspro_rss20_56.xml org.apache.manifoldcf.core.interfaces.ManifoldCFException: IO error: utf-8; filename=rseventspro_rss20_56.xml at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4203) ~[?:?] at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:3855) ~[?:?] at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:746) ~[?:?] at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?] Caused by: java.io.UnsupportedEncodingException: utf-8; filename=rseventspro_rss20_56.xml at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71) ~[?:1.8.0_212] at java.io.InputStreamReader.<init>(InputStreamReader.java:100) ~[?:1.8.0_212] at org.apache.manifoldcf.connectorcommon.fuzzyml.DecodingByteReceiver.dealWithBytes(DecodingByteReceiver.java:47) ~[?:?] at org.apache.manifoldcf.connectorcommon.fuzzyml.BOMEncodingDetector.dealWithRemainder(BOMEncodingDetector.java:250) ~[?:?] at org.apache.manifoldcf.connectorcommon.fuzzyml.SingleByteReceiver.dealWithBytes(SingleByteReceiver.java:52) ~[?:?] at org.apache.manifoldcf.connectorcommon.fuzzyml.Parser.parseWithCharsetDetection(Parser.java:74) ~[?:?] at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4174) ~[?:?] ... 3 more -- This message was sent by Atlassian Jira (v8.3.4#803005)