konrad created CONNECTORS-675:
---------------------------------

             Summary: MCF-ES fails to escape json correctly
                 Key: CONNECTORS-675
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-675
             Project: ManifoldCF
          Issue Type: Bug
          Components: Elastic Search connector
    Affects Versions: ManifoldCF 1.2
         Environment: MCF 1.2-SNAPSHOT running on Win2008R2.
java version "1.7.0_15"
Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
----------------
elasticsearch 0.90.0rc2 on ubuntu 12.10
java version "1.7.0_15"
Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
-----------------
Repository Connection: FileSystem
Output Connection: ElasticSearch

            Reporter: konrad


When crawling filesystem to elasticsearch, the generated json contains invalid 
utf-8 sequences. This causes elasticsearch to fail the index operation. 

Stacktrace from elasticsearch:

[2013-04-19 13:17:38,952][DEBUG][action.index             ] [Lighting Rod] 
[eses2][0], node[Ycj8DEZMQFuX7Gn2sSCUXw],
[P], s[STARTED]: Failed to execute [index 
{[eses][attachment][file:/C:/indexdir/Lüneburg/somefile],
source[{"uri" : "C:\\indexdir\\L�neburg\\somefile","allow_token_document" :
"__nosecurity__","deny_token_document" : "__nosecurity__","allow_token_share" : 
"__nosecurity__","deny_token_share" :
"__nosecurity__","type" : "attachment","_name" : "collection.pickle","file" : 
"KGRwMQp.....

org.elasticsearch.index.mapper.MapperParsingException: failed to parse [uri]
at 
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:395)
at 
org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:599)
at 
org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:467)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:506)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:450)
at 
org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:326)
at 
org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:203)
at 
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)
at 
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.elasticsearch.common.jackson.core.JsonParseException: Invalid 
UTF-8 start byte 0xfc
at [Source: [B@56c77e95; line: 1, column: 254]
at 
org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1378)
at 
org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:599)
at 
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3008)
at 
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:3002)
at 
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2165)
at 
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2092)
at 
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:275)
at 
org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:85)
at 
org.elasticsearch.common.xcontent.support.AbstractXContentParser.textOrNull(AbstractXContentParser.java:107)
at 
org.elasticsearch.index.mapper.core.StringFieldMapper.parseCreateField(StringFieldMapper.java:286)
at 
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:384)
... 11 more


In this case it is a german umlaut 'ü', but since 
ElasticSearchIndex#jsonStringEscape() doesn't do much more than escaping 
backslashes, I assume this affects a wider range of encoding specialities.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to