Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly. -------------------------------------------------------------------------------------------------------
Key: SOLR-2346 URL: https://issues.apache.org/jira/browse/SOLR-2346 Project: Solr Issue Type: Bug Components: contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows XP SP1, Machine is booted in Japanese Locale. Reporter: Prasad Deshpande Priority: Critical I am able to successfully index/search non-Engilsh files (like Hebrew, Japanese) which was encoded in UTF-8. However, When I tried to index data which was encoded in local encoding like Big5 for Japanese I could not see the desired results. The contents after indexing looked garbled for Big5 encoded document when I searched for all indexed documents. When I index attached non utf-8 file it indexes in following way - <result name="response" numFound="1" start="0"> - <doc> - <arr name="attr_content"> <str>�� ������</str> </arr> - <arr name="attr_content_encoding"> <str>Big5</str> </arr> - <arr name="attr_content_language"> <str>zh</str> </arr> - <arr name="attr_language"> <str>zh</str> </arr> - <arr name="attr_stream_size"> <str>17</str> </arr> - <arr name="content_type"> <str>text/plain</str> </arr> <str name="id">doc2</str> </doc> </result> </response> Here you said it index file in UTF8 however it seems that non UTF8 file gets indexed in Big5 encoding. Here I tried fetching indexed data stream in Big5 and converted in UTF8. String id = (String) resulDocument.getFirstValue("attr_content"); byte[] bytearray = id.getBytes("Big5"); String utf8String = new String(bytearray, "UTF-8"); It does not gives expected results. When I index UTF-8 file it indexes like following - <doc> - <arr name="attr_content"> <str>マイ ネットワーク</str> </arr> - <arr name="attr_content_encoding"> <str>UTF-8</str> </arr> - <arr name="attr_stream_content_type"> <str>text/plain</str> </arr> - <arr name="attr_stream_name"> <str>sample_jap_unicode.txt</str> </arr> - <arr name="attr_stream_size"> <str>28</str> </arr> - <arr name="attr_stream_source_info"> <str>myfile</str> </arr> - <arr name="content_type"> <str>text/plain</str> </arr> <str name="id">doc2</str> </doc> So, I can index and search UTF-8 data. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org