konrad created CONNECTORS-675:
---------------------------------
Summary: MCF-ES fails to escape json correctly
Key: CONNECTORS-675
URL: https://issues.apache.org/jira/browse/CONNECTORS-675
Project: ManifoldCF
Issue Type: Bug
Components: Elastic Search connector
Affects Versions: ManifoldCF 1.2
Environment: MCF 1.2-SNAPSHOT running on Win2008R2.
java version "1.7.0_15"
Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
----------------
elasticsearch 0.90.0rc2 on ubuntu 12.10
java version "1.7.0_15"
Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
-----------------
Repository Connection: FileSystem
Output Connection: ElasticSearch
Reporter: konrad
When crawling filesystem to elasticsearch, the generated json contains invalid
utf-8 sequences. This causes elasticsearch to fail the index operation.
Stacktrace from elasticsearch:
[2013-04-19 13:17:38,952][DEBUG][action.index ] [Lighting Rod]
[eses2][0], node[Ycj8DEZMQFuX7Gn2sSCUXw],
[P], s[STARTED]: Failed to execute [index
{[eses][attachment][file:/C:/indexdir/Lüneburg/somefile],
source[{"uri" : "C:\\indexdir\\L�neburg\\somefile","allow_token_document" :
"__nosecurity__","deny_token_document" : "__nosecurity__","allow_token_share" :
"__nosecurity__","deny_token_share" :
"__nosecurity__","type" : "attachment","_name" : "collection.pickle","file" :
"KGRwMQp.....
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [uri]
at
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:395)
at
org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:599)
at
org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:467)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:506)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:450)
at
org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:326)
at
org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:203)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.elasticsearch.common.jackson.core.JsonParseException: Invalid
UTF-8 start byte 0xfc
at [Source: [B@56c77e95; line: 1, column: 254]
at
org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1378)
at
org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:599)
at
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3008)
at
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:3002)
at
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2165)
at
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2092)
at
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:275)
at
org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:85)
at
org.elasticsearch.common.xcontent.support.AbstractXContentParser.textOrNull(AbstractXContentParser.java:107)
at
org.elasticsearch.index.mapper.core.StringFieldMapper.parseCreateField(StringFieldMapper.java:286)
at
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:384)
... 11 more
In this case it is a german umlaut 'ü', but since
ElasticSearchIndex#jsonStringEscape() doesn't do much more than escaping
backslashes, I assume this affects a wider range of encoding specialities.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira