This issue is probably due to my noobishness to ELK, Python, and Unicode. I have an index containing logstash-digested logs, including a field 'host_req', which contains a host name. Using Elasticsearch-py, I'm pulling that host name out of the record, and using it to search in another index. However, if the hostname contains multibyte characters, it fails with a UnicodeDecodeError Exactly the same query works fine when I enter it from the command line with 'curl -XGET' The unicode character is a lowercase 'a' with a diaeresis (two dots). The UTF-8 value is C3 A4, and the unicode code point seems to be 00E4 (the language is Swedish).
These curl commands work just fine from the command line: curl -XGET 'http://localhost:9200/logstash-2015.01.30/logs/_search?pretty=1' -d ' { "query" : {"match" :{"req_host" : "www.utkl\u00E4dningskl\u00E4derna.se" }}}' curl -XGET 'http://localhost:9200/logstash-2015.01.30/logs/_search?pretty=1' -d ' { "query" : {"match" :{"req_host" : "www.utklädningskläderna.se" }}}' They find and return the record (the second line shows how the hostname appears in the log I pull it from, showing the lowercase 'a' with a diaersis, in two places) I've written a very short Python script to show the problem: It uses hardwired queries, printing them and their type, then trying to use them in a search. ---- start code ---- #!/usr/bin/python # -*- coding: utf-8 -*- import json import elasticsearch es = elasticsearch.Elasticsearch() if __name__=="__main__": #uq = u'{ "query": { "match": { "req_host": "www.utklädningskläderna.se" }}}' # raw utf-8 characters. does not work #uq = u'{ "query": { "match": { "req_host": "www.utkl\u00E4dningskl\u00E4derna.se" }}}' # quoted unicode characters. does not work #uq = u'{ "query": { "match": { "req_host": "www.utkl\uC3A4dningskl\uC3A4derna.se" }}}' # quoted uft-8 characters. does not work uq = u'{ "query": { "match": { "req_host": "www.facebook.com" }}}' # non-unicode. works fine print "uq", type(uq), uq result = es.search(index="logstash-2015.01.30",doc_type="logs",timeout=1000,body=uq); if result["hits"]["total"] == 0: print "nothing found" else: print "found some" --- end code ---- If I run it as shown, with the 'facebook' query, it's fine - the output is: $python testutf8b.py uq <type 'unicode'> { "query": { "match": { "req_host": "www.facebook.com" }}} found some $ Note that the query string 'uq' is unicode. But if I use the other three strings, which include the Unicode characters, I get python testutf8b.py uq <type 'unicode'> { "query": { "match": { "req_host": "www.utklädningskläderna.se" }}} Traceback (most recent call last): File "testutf8b.py", line 15, in <module> result = es.search(index="logstash-2015.01.30",doc_type="logs",timeout=1000,body=uq); File "build/bdist.linux-x86_64/egg/elasticsearch/client/utils.py", line 68, in _wrapped File "build/bdist.linux-x86_64/egg/elasticsearch/client/__init__.py", line 497, in search File "build/bdist.linux-x86_64/egg/elasticsearch/transport.py", line 307, in perform_request File "build/bdist.linux-x86_64/egg/elasticsearch/connection/http_urllib3.py", line 82, in perform_request elasticsearch.exceptions.ConnectionError: ConnectionError('ascii' codec can't decode byte 0xc3 in position 45: ordinal not in range(128)) caused by: UnicodeDecodeError('ascii' codec can't decode byte 0xc3 in position 45: ordinal not in range(128)) This is under Centos 7, using ES 1.5.0. The logs were digested into ES under a slightly older version, using logstasth-1.4.2 Any ideas? ES documentation contains sections about codecs, but that's for analysis. This looks to me like an elasticsearch-py library issue, (or I'm doing something stupid). thanks! PT -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/505b8b8f-faeb-4954-8fcb-e3107c2c80b6%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.