[ https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Prasad Deshpande updated SOLR-2346: ----------------------------------- Attachment: sample_jap_non_UTF-8.txt sample_jap_UTF-8.txt I have verified use case using attached files. > Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no > getting indexed correctly. > ------------------------------------------------------------------------------------------------------- > > Key: SOLR-2346 > URL: https://issues.apache.org/jira/browse/SOLR-2346 > Project: Solr > Issue Type: Bug > Components: contrib - Solr Cell (Tika extraction) > Affects Versions: 1.4.1 > Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows > XP SP1, Machine is booted in Japanese Locale. > Reporter: Prasad Deshpande > Priority: Critical > Attachments: sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt > > > I am able to successfully index/search non-Engilsh files (like Hebrew, > Japanese) which was encoded in UTF-8. However, When I tried to index data > which was encoded in local encoding like Big5 for Japanese I could not see > the desired results. The contents after indexing looked garbled for Big5 > encoded document when I searched for all indexed documents. When I index > attached non utf-8 file it indexes in following way > - <result name="response" numFound="1" start="0"> > - <doc> > - <arr name="attr_content"> > <str>�� ������</str> > </arr> > - <arr name="attr_content_encoding"> > <str>Big5</str> > </arr> > - <arr name="attr_content_language"> > <str>zh</str> > </arr> > - <arr name="attr_language"> > <str>zh</str> > </arr> > - <arr name="attr_stream_size"> > <str>17</str> > </arr> > - <arr name="content_type"> > <str>text/plain</str> > </arr> > <str name="id">doc2</str> > </doc> > </result> > </response> > Here you said it index file in UTF8 however it seems that non UTF8 file gets > indexed in Big5 encoding. > Here I tried fetching indexed data stream in Big5 and converted in UTF8. > String id = (String) resulDocument.getFirstValue("attr_content"); > byte[] bytearray = id.getBytes("Big5"); > String utf8String = new String(bytearray, "UTF-8"); > It does not gives expected results. > When I index UTF-8 file it indexes like following > - <doc> > - <arr name="attr_content"> > <str>マイ ネットワーク</str> > </arr> > - <arr name="attr_content_encoding"> > <str>UTF-8</str> > </arr> > - <arr name="attr_stream_content_type"> > <str>text/plain</str> > </arr> > - <arr name="attr_stream_name"> > <str>sample_jap_unicode.txt</str> > </arr> > - <arr name="attr_stream_size"> > <str>28</str> > </arr> > - <arr name="attr_stream_source_info"> > <str>myfile</str> > </arr> > - <arr name="content_type"> > <str>text/plain</str> > </arr> > <str name="id">doc2</str> > </doc> > So, I can index and search UTF-8 data. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org