[jira] [Updated] (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.
[ https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-2346: - Attachment: SOLR-2346.patch New patch attached. I updated for current trunk and getCharsetFromContentType() method to remove unnecessary strings after the charset value. I think this is ready to go. Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly. --- Key: SOLR-2346 URL: https://issues.apache.org/jira/browse/SOLR-2346 Project: Solr Issue Type: Bug Components: contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1, 3.1, 4.0 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows XP SP1, Machine was booted in Japanese Locale. Reporter: Prasad Deshpande Assignee: Koji Sekiguchi Priority: Critical Fix For: 3.6, 4.0 Attachments: NormalSave.msg, SOLR-2346.patch, SOLR-2346.patch, UnicodeSave.msg, sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt I am able to successfully index/search non-Engilsh files (like Hebrew, Japanese) which was encoded in UTF-8. However, When I tried to index data which was encoded in local encoding like Big5 for Japanese I could not see the desired results. The contents after indexing looked garbled for Big5 encoded document when I searched for all indexed documents. When I index attached non utf-8 file it indexes in following way - result name=response numFound=1 start=0 - doc - arr name=attr_content str�� ��/str /arr - arr name=attr_content_encoding strBig5/str /arr - arr name=attr_content_language strzh/str /arr - arr name=attr_language strzh/str /arr - arr name=attr_stream_size str17/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc /result /response Here you said it index file in UTF8 however it seems that non UTF8 file gets indexed in Big5 encoding. Here I tried fetching indexed data stream in Big5 and converted in UTF8. String id = (String) resulDocument.getFirstValue(attr_content); byte[] bytearray = id.getBytes(Big5); String utf8String = new String(bytearray, UTF-8); It does not gives expected results. When I index UTF-8 file it indexes like following - doc - arr name=attr_content strマイ ネットワーク/str /arr - arr name=attr_content_encoding strUTF-8/str /arr - arr name=attr_stream_content_type strtext/plain/str /arr - arr name=attr_stream_name strsample_jap_unicode.txt/str /arr - arr name=attr_stream_size str28/str /arr - arr name=attr_stream_source_info strmyfile/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc So, I can index and search UTF-8 data. For more reference below is the discussion with Yonik. Please find attached TXT file which I was using to index and search. curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8; -F myfile=@sample_jap_non_UTF-8 One problem is that you are giving big5 encoded text to Solr and saying that it's UTF8. Here's one way to actually tell solr what the encoding of the text you are sending is: curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true; --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; charset=big5' Now the problem appears that for some reason, this doesn't work... Could you open a JIRA issue and attach your two test files? -Yonik http://lucidimagination.com -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.
[ https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-2346: - Attachment: SOLR-2346.patch bq. getCharsetFromContentType() method to remove unnecessary strings after the charset value. My fault. This is not necessary. I should add --data-binary option to curl. Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly. --- Key: SOLR-2346 URL: https://issues.apache.org/jira/browse/SOLR-2346 Project: Solr Issue Type: Bug Components: contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1, 3.1, 4.0 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows XP SP1, Machine was booted in Japanese Locale. Reporter: Prasad Deshpande Assignee: Koji Sekiguchi Priority: Critical Fix For: 3.6, 4.0 Attachments: NormalSave.msg, SOLR-2346.patch, SOLR-2346.patch, SOLR-2346.patch, UnicodeSave.msg, sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt I am able to successfully index/search non-Engilsh files (like Hebrew, Japanese) which was encoded in UTF-8. However, When I tried to index data which was encoded in local encoding like Big5 for Japanese I could not see the desired results. The contents after indexing looked garbled for Big5 encoded document when I searched for all indexed documents. When I index attached non utf-8 file it indexes in following way - result name=response numFound=1 start=0 - doc - arr name=attr_content str�� ��/str /arr - arr name=attr_content_encoding strBig5/str /arr - arr name=attr_content_language strzh/str /arr - arr name=attr_language strzh/str /arr - arr name=attr_stream_size str17/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc /result /response Here you said it index file in UTF8 however it seems that non UTF8 file gets indexed in Big5 encoding. Here I tried fetching indexed data stream in Big5 and converted in UTF8. String id = (String) resulDocument.getFirstValue(attr_content); byte[] bytearray = id.getBytes(Big5); String utf8String = new String(bytearray, UTF-8); It does not gives expected results. When I index UTF-8 file it indexes like following - doc - arr name=attr_content strマイ ネットワーク/str /arr - arr name=attr_content_encoding strUTF-8/str /arr - arr name=attr_stream_content_type strtext/plain/str /arr - arr name=attr_stream_name strsample_jap_unicode.txt/str /arr - arr name=attr_stream_size str28/str /arr - arr name=attr_stream_source_info strmyfile/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc So, I can index and search UTF-8 data. For more reference below is the discussion with Yonik. Please find attached TXT file which I was using to index and search. curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8; -F myfile=@sample_jap_non_UTF-8 One problem is that you are giving big5 encoded text to Solr and saying that it's UTF8. Here's one way to actually tell solr what the encoding of the text you are sending is: curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true; --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; charset=big5' Now the problem appears that for some reason, this doesn't work... Could you open a JIRA issue and attach your two test files? -Yonik http://lucidimagination.com -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.
[ https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-2346: - Affects Version/s: 4.0 3.1 Fix Version/s: 4.0 3.2 Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly. --- Key: SOLR-2346 URL: https://issues.apache.org/jira/browse/SOLR-2346 Project: Solr Issue Type: Bug Components: contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1, 3.1, 4.0 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows XP SP1, Machine was booted in Japanese Locale. Reporter: Prasad Deshpande Assignee: Koji Sekiguchi Priority: Critical Fix For: 3.2, 4.0 Attachments: NormalSave.msg, SOLR-2346.patch, UnicodeSave.msg, sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt I am able to successfully index/search non-Engilsh files (like Hebrew, Japanese) which was encoded in UTF-8. However, When I tried to index data which was encoded in local encoding like Big5 for Japanese I could not see the desired results. The contents after indexing looked garbled for Big5 encoded document when I searched for all indexed documents. When I index attached non utf-8 file it indexes in following way - result name=response numFound=1 start=0 - doc - arr name=attr_content str�� ��/str /arr - arr name=attr_content_encoding strBig5/str /arr - arr name=attr_content_language strzh/str /arr - arr name=attr_language strzh/str /arr - arr name=attr_stream_size str17/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc /result /response Here you said it index file in UTF8 however it seems that non UTF8 file gets indexed in Big5 encoding. Here I tried fetching indexed data stream in Big5 and converted in UTF8. String id = (String) resulDocument.getFirstValue(attr_content); byte[] bytearray = id.getBytes(Big5); String utf8String = new String(bytearray, UTF-8); It does not gives expected results. When I index UTF-8 file it indexes like following - doc - arr name=attr_content strマイ ネットワーク/str /arr - arr name=attr_content_encoding strUTF-8/str /arr - arr name=attr_stream_content_type strtext/plain/str /arr - arr name=attr_stream_name strsample_jap_unicode.txt/str /arr - arr name=attr_stream_size str28/str /arr - arr name=attr_stream_source_info strmyfile/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc So, I can index and search UTF-8 data. For more reference below is the discussion with Yonik. Please find attached TXT file which I was using to index and search. curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8; -F myfile=@sample_jap_non_UTF-8 One problem is that you are giving big5 encoded text to Solr and saying that it's UTF8. Here's one way to actually tell solr what the encoding of the text you are sending is: curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true; --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; charset=big5' Now the problem appears that for some reason, this doesn't work... Could you open a JIRA issue and attach your two test files? -Yonik http://lucidimagination.com -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.
[ https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasad Deshpande updated SOLR-2346: --- Attachment: UnicodeSave.msg NormalSave.msg Hope following issue could be same. Above are the Hebrew *.msg files which I have tried to index using following command. curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true; -F myfile=@NormalSave.msg File UnicodeSave.msg was saved as Outlook Message Format - Unicode and normalSave.msg was saved as Outlook Message Format. When I search with *:* in solr it gives junk characters in case NormalSave.msg and in case of UnicodeSave.msg it gives empty attr_content. Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly. --- Key: SOLR-2346 URL: https://issues.apache.org/jira/browse/SOLR-2346 Project: Solr Issue Type: Bug Components: contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows XP SP1, Machine was booted in Japanese Locale. Reporter: Prasad Deshpande Priority: Critical Attachments: NormalSave.msg, UnicodeSave.msg, sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt I am able to successfully index/search non-Engilsh files (like Hebrew, Japanese) which was encoded in UTF-8. However, When I tried to index data which was encoded in local encoding like Big5 for Japanese I could not see the desired results. The contents after indexing looked garbled for Big5 encoded document when I searched for all indexed documents. When I index attached non utf-8 file it indexes in following way - result name=response numFound=1 start=0 - doc - arr name=attr_content str�� ��/str /arr - arr name=attr_content_encoding strBig5/str /arr - arr name=attr_content_language strzh/str /arr - arr name=attr_language strzh/str /arr - arr name=attr_stream_size str17/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc /result /response Here you said it index file in UTF8 however it seems that non UTF8 file gets indexed in Big5 encoding. Here I tried fetching indexed data stream in Big5 and converted in UTF8. String id = (String) resulDocument.getFirstValue(attr_content); byte[] bytearray = id.getBytes(Big5); String utf8String = new String(bytearray, UTF-8); It does not gives expected results. When I index UTF-8 file it indexes like following - doc - arr name=attr_content strマイ ネットワーク/str /arr - arr name=attr_content_encoding strUTF-8/str /arr - arr name=attr_stream_content_type strtext/plain/str /arr - arr name=attr_stream_name strsample_jap_unicode.txt/str /arr - arr name=attr_stream_size str28/str /arr - arr name=attr_stream_source_info strmyfile/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc So, I can index and search UTF-8 data. For more reference below is the discussion with Yonik. Please find attached TXT file which I was using to index and search. curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8; -F myfile=@sample_jap_non_UTF-8 One problem is that you are giving big5 encoded text to Solr and saying that it's UTF8. Here's one way to actually tell solr what the encoding of the text you are sending is: curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true; --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; charset=big5' Now the problem appears that for some reason, this doesn't work... Could you open a JIRA issue and attach your two test files? -Yonik http://lucidimagination.com -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.
[ https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasad Deshpande updated SOLR-2346: --- Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows XP SP1, Machine was booted in Japanese Locale. (was: Solr 1.4.1, Packaged Jetty as servlet container, Windows XP SP1, Machine is booted in Japanese Locale.) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly. --- Key: SOLR-2346 URL: https://issues.apache.org/jira/browse/SOLR-2346 Project: Solr Issue Type: Bug Components: contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows XP SP1, Machine was booted in Japanese Locale. Reporter: Prasad Deshpande Priority: Critical Attachments: sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt I am able to successfully index/search non-Engilsh files (like Hebrew, Japanese) which was encoded in UTF-8. However, When I tried to index data which was encoded in local encoding like Big5 for Japanese I could not see the desired results. The contents after indexing looked garbled for Big5 encoded document when I searched for all indexed documents. When I index attached non utf-8 file it indexes in following way - result name=response numFound=1 start=0 - doc - arr name=attr_content str�� ��/str /arr - arr name=attr_content_encoding strBig5/str /arr - arr name=attr_content_language strzh/str /arr - arr name=attr_language strzh/str /arr - arr name=attr_stream_size str17/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc /result /response Here you said it index file in UTF8 however it seems that non UTF8 file gets indexed in Big5 encoding. Here I tried fetching indexed data stream in Big5 and converted in UTF8. String id = (String) resulDocument.getFirstValue(attr_content); byte[] bytearray = id.getBytes(Big5); String utf8String = new String(bytearray, UTF-8); It does not gives expected results. When I index UTF-8 file it indexes like following - doc - arr name=attr_content strマイ ネットワーク/str /arr - arr name=attr_content_encoding strUTF-8/str /arr - arr name=attr_stream_content_type strtext/plain/str /arr - arr name=attr_stream_name strsample_jap_unicode.txt/str /arr - arr name=attr_stream_size str28/str /arr - arr name=attr_stream_source_info strmyfile/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc So, I can index and search UTF-8 data. For more reference below is the discussion with Yonik. Please find attached TXT file which I was using to index and search. curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8; -F myfile=@sample_jap_non_UTF-8 One problem is that you are giving big5 encoded text to Solr and saying that it's UTF8. Here's one way to actually tell solr what the encoding of the text you are sending is: curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true; --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; charset=big5' Now the problem appears that for some reason, this doesn't work... Could you open a JIRA issue and attach your two test files? -Yonik http://lucidimagination.com -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.
[ https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasad Deshpande updated SOLR-2346: --- Attachment: sample_jap_non_UTF-8.txt sample_jap_UTF-8.txt I have verified use case using attached files. Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly. --- Key: SOLR-2346 URL: https://issues.apache.org/jira/browse/SOLR-2346 Project: Solr Issue Type: Bug Components: contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows XP SP1, Machine is booted in Japanese Locale. Reporter: Prasad Deshpande Priority: Critical Attachments: sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt I am able to successfully index/search non-Engilsh files (like Hebrew, Japanese) which was encoded in UTF-8. However, When I tried to index data which was encoded in local encoding like Big5 for Japanese I could not see the desired results. The contents after indexing looked garbled for Big5 encoded document when I searched for all indexed documents. When I index attached non utf-8 file it indexes in following way - result name=response numFound=1 start=0 - doc - arr name=attr_content str�� ��/str /arr - arr name=attr_content_encoding strBig5/str /arr - arr name=attr_content_language strzh/str /arr - arr name=attr_language strzh/str /arr - arr name=attr_stream_size str17/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc /result /response Here you said it index file in UTF8 however it seems that non UTF8 file gets indexed in Big5 encoding. Here I tried fetching indexed data stream in Big5 and converted in UTF8. String id = (String) resulDocument.getFirstValue(attr_content); byte[] bytearray = id.getBytes(Big5); String utf8String = new String(bytearray, UTF-8); It does not gives expected results. When I index UTF-8 file it indexes like following - doc - arr name=attr_content strマイ ネットワーク/str /arr - arr name=attr_content_encoding strUTF-8/str /arr - arr name=attr_stream_content_type strtext/plain/str /arr - arr name=attr_stream_name strsample_jap_unicode.txt/str /arr - arr name=attr_stream_size str28/str /arr - arr name=attr_stream_source_info strmyfile/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc So, I can index and search UTF-8 data. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.
[ https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasad Deshpande updated SOLR-2346: --- Description: I am able to successfully index/search non-Engilsh files (like Hebrew, Japanese) which was encoded in UTF-8. However, When I tried to index data which was encoded in local encoding like Big5 for Japanese I could not see the desired results. The contents after indexing looked garbled for Big5 encoded document when I searched for all indexed documents. When I index attached non utf-8 file it indexes in following way - result name=response numFound=1 start=0 - doc - arr name=attr_content str�� ��/str /arr - arr name=attr_content_encoding strBig5/str /arr - arr name=attr_content_language strzh/str /arr - arr name=attr_language strzh/str /arr - arr name=attr_stream_size str17/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc /result /response Here you said it index file in UTF8 however it seems that non UTF8 file gets indexed in Big5 encoding. Here I tried fetching indexed data stream in Big5 and converted in UTF8. String id = (String) resulDocument.getFirstValue(attr_content); byte[] bytearray = id.getBytes(Big5); String utf8String = new String(bytearray, UTF-8); It does not gives expected results. When I index UTF-8 file it indexes like following - doc - arr name=attr_content strマイ ネットワーク/str /arr - arr name=attr_content_encoding strUTF-8/str /arr - arr name=attr_stream_content_type strtext/plain/str /arr - arr name=attr_stream_name strsample_jap_unicode.txt/str /arr - arr name=attr_stream_size str28/str /arr - arr name=attr_stream_source_info strmyfile/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc So, I can index and search UTF-8 data. For more reference below is the discussion with Yonik. Please find attached TXT file which I was using to index and search. curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8; -F myfile=@sample_jap_non_UTF-8 One problem is that you are giving big5 encoded text to Solr and saying that it's UTF8. Here's one way to actually tell solr what the encoding of the text you are sending is: curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true; --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; charset=big5' Now the problem appears that for some reason, this doesn't work... Could you open a JIRA issue and attach your two test files? -Yonik http://lucidimagination.com was: I am able to successfully index/search non-Engilsh files (like Hebrew, Japanese) which was encoded in UTF-8. However, When I tried to index data which was encoded in local encoding like Big5 for Japanese I could not see the desired results. The contents after indexing looked garbled for Big5 encoded document when I searched for all indexed documents. When I index attached non utf-8 file it indexes in following way - result name=response numFound=1 start=0 - doc - arr name=attr_content str�� ��/str /arr - arr name=attr_content_encoding strBig5/str /arr - arr name=attr_content_language strzh/str /arr - arr name=attr_language strzh/str /arr - arr name=attr_stream_size str17/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc /result /response Here you said it index file in UTF8 however it seems that non UTF8 file gets indexed in Big5 encoding. Here I tried fetching indexed data stream in Big5 and converted in UTF8. String id = (String) resulDocument.getFirstValue(attr_content); byte[] bytearray = id.getBytes(Big5); String utf8String = new String(bytearray, UTF-8); It does not gives expected results. When I index UTF-8 file it indexes like following - doc - arr name=attr_content strマイ ネットワーク/str /arr - arr name=attr_content_encoding strUTF-8/str /arr - arr name=attr_stream_content_type strtext/plain/str /arr - arr name=attr_stream_name strsample_jap_unicode.txt/str /arr - arr name=attr_stream_size str28/str /arr - arr name=attr_stream_source_info strmyfile/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc So, I can index and search UTF-8 data. Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly. --- Key: SOLR-2346 URL: https://issues.apache.org/jira/browse/SOLR-2346 Project: Solr Issue Type: Bug Components: contrib - Solr Cell