[jira] [Updated] (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.

2011-12-27 Thread Koji Sekiguchi (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-2346:
-

Attachment: SOLR-2346.patch

New patch attached. I updated for current trunk and getCharsetFromContentType() 
method to remove unnecessary strings after the charset value.

I think this is ready to go.

 Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no 
 getting indexed correctly.
 ---

 Key: SOLR-2346
 URL: https://issues.apache.org/jira/browse/SOLR-2346
 Project: Solr
  Issue Type: Bug
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4.1, 3.1, 4.0
 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows 
 XP SP1, Machine was booted in Japanese Locale.
Reporter: Prasad Deshpande
Assignee: Koji Sekiguchi
Priority: Critical
 Fix For: 3.6, 4.0

 Attachments: NormalSave.msg, SOLR-2346.patch, SOLR-2346.patch, 
 UnicodeSave.msg, sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt


 I am able to successfully index/search non-Engilsh files (like Hebrew, 
 Japanese) which was encoded in UTF-8. However, When I tried to index data 
 which was encoded in local encoding like Big5 for Japanese I could not see 
 the desired results. The contents after indexing looked garbled for Big5 
 encoded document when I searched for all indexed documents. When I index 
 attached non utf-8 file it indexes in following way
 - result name=response numFound=1 start=0
 - doc
 - arr name=attr_content
   str�� ��/str
   /arr
 - arr name=attr_content_encoding
   strBig5/str
   /arr
 - arr name=attr_content_language
   strzh/str
   /arr
 - arr name=attr_language
   strzh/str
   /arr
 - arr name=attr_stream_size
   str17/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
   /result
   /response
 Here you said it index file in UTF8 however it seems that non UTF8 file gets 
 indexed in Big5 encoding.
 Here I tried fetching indexed data stream in Big5 and converted in UTF8.
 String id = (String) resulDocument.getFirstValue(attr_content);
 byte[] bytearray = id.getBytes(Big5);
 String utf8String = new String(bytearray, UTF-8);
 It does not gives expected results.
 When I index UTF-8 file it indexes like following
 - doc
 - arr name=attr_content
   strマイ ネットワーク/str
   /arr
 - arr name=attr_content_encoding
   strUTF-8/str
   /arr
 - arr name=attr_stream_content_type
   strtext/plain/str
   /arr
 - arr name=attr_stream_name
   strsample_jap_unicode.txt/str
   /arr
 - arr name=attr_stream_size
   str28/str
   /arr
 - arr name=attr_stream_source_info
   strmyfile/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
 So, I can index and search UTF-8 data.
 For more reference below is the discussion with Yonik.
 Please find attached TXT file which I was using to index and search.
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8;
  -F myfile=@sample_jap_non_UTF-8
 One problem is that you are giving big5 encoded text to Solr and saying that 
 it's UTF8.
 Here's one way to actually tell solr what the encoding of the text you are 
 sending is:
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true;
  --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; 
 charset=big5'
 Now the problem appears that for some reason, this doesn't work...
 Could you open a JIRA issue and attach your two test files?
 -Yonik
 http://lucidimagination.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.

2011-12-27 Thread Koji Sekiguchi (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-2346:
-

Attachment: SOLR-2346.patch

bq. getCharsetFromContentType() method to remove unnecessary strings after the 
charset value.

My fault. This is not necessary. I should add --data-binary option to curl.

 Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no 
 getting indexed correctly.
 ---

 Key: SOLR-2346
 URL: https://issues.apache.org/jira/browse/SOLR-2346
 Project: Solr
  Issue Type: Bug
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4.1, 3.1, 4.0
 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows 
 XP SP1, Machine was booted in Japanese Locale.
Reporter: Prasad Deshpande
Assignee: Koji Sekiguchi
Priority: Critical
 Fix For: 3.6, 4.0

 Attachments: NormalSave.msg, SOLR-2346.patch, SOLR-2346.patch, 
 SOLR-2346.patch, UnicodeSave.msg, sample_jap_UTF-8.txt, 
 sample_jap_non_UTF-8.txt


 I am able to successfully index/search non-Engilsh files (like Hebrew, 
 Japanese) which was encoded in UTF-8. However, When I tried to index data 
 which was encoded in local encoding like Big5 for Japanese I could not see 
 the desired results. The contents after indexing looked garbled for Big5 
 encoded document when I searched for all indexed documents. When I index 
 attached non utf-8 file it indexes in following way
 - result name=response numFound=1 start=0
 - doc
 - arr name=attr_content
   str�� ��/str
   /arr
 - arr name=attr_content_encoding
   strBig5/str
   /arr
 - arr name=attr_content_language
   strzh/str
   /arr
 - arr name=attr_language
   strzh/str
   /arr
 - arr name=attr_stream_size
   str17/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
   /result
   /response
 Here you said it index file in UTF8 however it seems that non UTF8 file gets 
 indexed in Big5 encoding.
 Here I tried fetching indexed data stream in Big5 and converted in UTF8.
 String id = (String) resulDocument.getFirstValue(attr_content);
 byte[] bytearray = id.getBytes(Big5);
 String utf8String = new String(bytearray, UTF-8);
 It does not gives expected results.
 When I index UTF-8 file it indexes like following
 - doc
 - arr name=attr_content
   strマイ ネットワーク/str
   /arr
 - arr name=attr_content_encoding
   strUTF-8/str
   /arr
 - arr name=attr_stream_content_type
   strtext/plain/str
   /arr
 - arr name=attr_stream_name
   strsample_jap_unicode.txt/str
   /arr
 - arr name=attr_stream_size
   str28/str
   /arr
 - arr name=attr_stream_source_info
   strmyfile/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
 So, I can index and search UTF-8 data.
 For more reference below is the discussion with Yonik.
 Please find attached TXT file which I was using to index and search.
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8;
  -F myfile=@sample_jap_non_UTF-8
 One problem is that you are giving big5 encoded text to Solr and saying that 
 it's UTF8.
 Here's one way to actually tell solr what the encoding of the text you are 
 sending is:
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true;
  --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; 
 charset=big5'
 Now the problem appears that for some reason, this doesn't work...
 Could you open a JIRA issue and attach your two test files?
 -Yonik
 http://lucidimagination.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.

2011-03-08 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-2346:
-

Affects Version/s: 4.0
   3.1
Fix Version/s: 4.0
   3.2

 Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no 
 getting indexed correctly.
 ---

 Key: SOLR-2346
 URL: https://issues.apache.org/jira/browse/SOLR-2346
 Project: Solr
  Issue Type: Bug
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4.1, 3.1, 4.0
 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows 
 XP SP1, Machine was booted in Japanese Locale.
Reporter: Prasad Deshpande
Assignee: Koji Sekiguchi
Priority: Critical
 Fix For: 3.2, 4.0

 Attachments: NormalSave.msg, SOLR-2346.patch, UnicodeSave.msg, 
 sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt


 I am able to successfully index/search non-Engilsh files (like Hebrew, 
 Japanese) which was encoded in UTF-8. However, When I tried to index data 
 which was encoded in local encoding like Big5 for Japanese I could not see 
 the desired results. The contents after indexing looked garbled for Big5 
 encoded document when I searched for all indexed documents. When I index 
 attached non utf-8 file it indexes in following way
 - result name=response numFound=1 start=0
 - doc
 - arr name=attr_content
   str�� ��/str
   /arr
 - arr name=attr_content_encoding
   strBig5/str
   /arr
 - arr name=attr_content_language
   strzh/str
   /arr
 - arr name=attr_language
   strzh/str
   /arr
 - arr name=attr_stream_size
   str17/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
   /result
   /response
 Here you said it index file in UTF8 however it seems that non UTF8 file gets 
 indexed in Big5 encoding.
 Here I tried fetching indexed data stream in Big5 and converted in UTF8.
 String id = (String) resulDocument.getFirstValue(attr_content);
 byte[] bytearray = id.getBytes(Big5);
 String utf8String = new String(bytearray, UTF-8);
 It does not gives expected results.
 When I index UTF-8 file it indexes like following
 - doc
 - arr name=attr_content
   strマイ ネットワーク/str
   /arr
 - arr name=attr_content_encoding
   strUTF-8/str
   /arr
 - arr name=attr_stream_content_type
   strtext/plain/str
   /arr
 - arr name=attr_stream_name
   strsample_jap_unicode.txt/str
   /arr
 - arr name=attr_stream_size
   str28/str
   /arr
 - arr name=attr_stream_source_info
   strmyfile/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
 So, I can index and search UTF-8 data.
 For more reference below is the discussion with Yonik.
 Please find attached TXT file which I was using to index and search.
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8;
  -F myfile=@sample_jap_non_UTF-8
 One problem is that you are giving big5 encoded text to Solr and saying that 
 it's UTF8.
 Here's one way to actually tell solr what the encoding of the text you are 
 sending is:
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true;
  --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; 
 charset=big5'
 Now the problem appears that for some reason, this doesn't work...
 Could you open a JIRA issue and attach your two test files?
 -Yonik
 http://lucidimagination.com

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.

2011-02-04 Thread Prasad Deshpande (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasad Deshpande updated SOLR-2346:
---

Attachment: UnicodeSave.msg
NormalSave.msg

Hope following issue could be same.

Above are the Hebrew *.msg files which I have tried to index using following 
command.
curl 
http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true;
 -F myfile=@NormalSave.msg 

File UnicodeSave.msg was saved as Outlook Message Format - Unicode and 
normalSave.msg was saved as Outlook Message Format.
When I search with *:* in solr it gives junk characters in case NormalSave.msg 
and in case of UnicodeSave.msg it gives empty attr_content.

 Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no 
 getting indexed correctly.
 ---

 Key: SOLR-2346
 URL: https://issues.apache.org/jira/browse/SOLR-2346
 Project: Solr
  Issue Type: Bug
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4.1
 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows 
 XP SP1, Machine was booted in Japanese Locale.
Reporter: Prasad Deshpande
Priority: Critical
 Attachments: NormalSave.msg, UnicodeSave.msg, sample_jap_UTF-8.txt, 
 sample_jap_non_UTF-8.txt


 I am able to successfully index/search non-Engilsh files (like Hebrew, 
 Japanese) which was encoded in UTF-8. However, When I tried to index data 
 which was encoded in local encoding like Big5 for Japanese I could not see 
 the desired results. The contents after indexing looked garbled for Big5 
 encoded document when I searched for all indexed documents. When I index 
 attached non utf-8 file it indexes in following way
 - result name=response numFound=1 start=0
 - doc
 - arr name=attr_content
   str�� ��/str
   /arr
 - arr name=attr_content_encoding
   strBig5/str
   /arr
 - arr name=attr_content_language
   strzh/str
   /arr
 - arr name=attr_language
   strzh/str
   /arr
 - arr name=attr_stream_size
   str17/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
   /result
   /response
 Here you said it index file in UTF8 however it seems that non UTF8 file gets 
 indexed in Big5 encoding.
 Here I tried fetching indexed data stream in Big5 and converted in UTF8.
 String id = (String) resulDocument.getFirstValue(attr_content);
 byte[] bytearray = id.getBytes(Big5);
 String utf8String = new String(bytearray, UTF-8);
 It does not gives expected results.
 When I index UTF-8 file it indexes like following
 - doc
 - arr name=attr_content
   strマイ ネットワーク/str
   /arr
 - arr name=attr_content_encoding
   strUTF-8/str
   /arr
 - arr name=attr_stream_content_type
   strtext/plain/str
   /arr
 - arr name=attr_stream_name
   strsample_jap_unicode.txt/str
   /arr
 - arr name=attr_stream_size
   str28/str
   /arr
 - arr name=attr_stream_source_info
   strmyfile/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
 So, I can index and search UTF-8 data.
 For more reference below is the discussion with Yonik.
 Please find attached TXT file which I was using to index and search.
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8;
  -F myfile=@sample_jap_non_UTF-8
 One problem is that you are giving big5 encoded text to Solr and saying that 
 it's UTF8.
 Here's one way to actually tell solr what the encoding of the text you are 
 sending is:
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true;
  --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; 
 charset=big5'
 Now the problem appears that for some reason, this doesn't work...
 Could you open a JIRA issue and attach your two test files?
 -Yonik
 http://lucidimagination.com

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.

2011-02-03 Thread Prasad Deshpande (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasad Deshpande updated SOLR-2346:
---

Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows XP 
SP1, Machine was booted in Japanese Locale.  (was: Solr 1.4.1, Packaged Jetty 
as servlet container, Windows XP SP1, Machine is booted in Japanese Locale.)

 Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no 
 getting indexed correctly.
 ---

 Key: SOLR-2346
 URL: https://issues.apache.org/jira/browse/SOLR-2346
 Project: Solr
  Issue Type: Bug
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4.1
 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows 
 XP SP1, Machine was booted in Japanese Locale.
Reporter: Prasad Deshpande
Priority: Critical
 Attachments: sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt


 I am able to successfully index/search non-Engilsh files (like Hebrew, 
 Japanese) which was encoded in UTF-8. However, When I tried to index data 
 which was encoded in local encoding like Big5 for Japanese I could not see 
 the desired results. The contents after indexing looked garbled for Big5 
 encoded document when I searched for all indexed documents. When I index 
 attached non utf-8 file it indexes in following way
 - result name=response numFound=1 start=0
 - doc
 - arr name=attr_content
   str�� ��/str
   /arr
 - arr name=attr_content_encoding
   strBig5/str
   /arr
 - arr name=attr_content_language
   strzh/str
   /arr
 - arr name=attr_language
   strzh/str
   /arr
 - arr name=attr_stream_size
   str17/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
   /result
   /response
 Here you said it index file in UTF8 however it seems that non UTF8 file gets 
 indexed in Big5 encoding.
 Here I tried fetching indexed data stream in Big5 and converted in UTF8.
 String id = (String) resulDocument.getFirstValue(attr_content);
 byte[] bytearray = id.getBytes(Big5);
 String utf8String = new String(bytearray, UTF-8);
 It does not gives expected results.
 When I index UTF-8 file it indexes like following
 - doc
 - arr name=attr_content
   strマイ ネットワーク/str
   /arr
 - arr name=attr_content_encoding
   strUTF-8/str
   /arr
 - arr name=attr_stream_content_type
   strtext/plain/str
   /arr
 - arr name=attr_stream_name
   strsample_jap_unicode.txt/str
   /arr
 - arr name=attr_stream_size
   str28/str
   /arr
 - arr name=attr_stream_source_info
   strmyfile/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
 So, I can index and search UTF-8 data.
 For more reference below is the discussion with Yonik.
 Please find attached TXT file which I was using to index and search.
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8;
  -F myfile=@sample_jap_non_UTF-8
 One problem is that you are giving big5 encoded text to Solr and saying that 
 it's UTF8.
 Here's one way to actually tell solr what the encoding of the text you are 
 sending is:
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true;
  --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; 
 charset=big5'
 Now the problem appears that for some reason, this doesn't work...
 Could you open a JIRA issue and attach your two test files?
 -Yonik
 http://lucidimagination.com

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.

2011-02-02 Thread Prasad Deshpande (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasad Deshpande updated SOLR-2346:
---

Attachment: sample_jap_non_UTF-8.txt
sample_jap_UTF-8.txt

I have verified use case using attached files.

 Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no 
 getting indexed correctly.
 ---

 Key: SOLR-2346
 URL: https://issues.apache.org/jira/browse/SOLR-2346
 Project: Solr
  Issue Type: Bug
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4.1
 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows 
 XP SP1, Machine is booted in Japanese Locale.
Reporter: Prasad Deshpande
Priority: Critical
 Attachments: sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt


 I am able to successfully index/search non-Engilsh files (like Hebrew, 
 Japanese) which was encoded in UTF-8. However, When I tried to index data 
 which was encoded in local encoding like Big5 for Japanese I could not see 
 the desired results. The contents after indexing looked garbled for Big5 
 encoded document when I searched for all indexed documents. When I index 
 attached non utf-8 file it indexes in following way
 - result name=response numFound=1 start=0
 - doc
 - arr name=attr_content
   str�� ��/str
   /arr
 - arr name=attr_content_encoding
   strBig5/str
   /arr
 - arr name=attr_content_language
   strzh/str
   /arr
 - arr name=attr_language
   strzh/str
   /arr
 - arr name=attr_stream_size
   str17/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
   /result
   /response
 Here you said it index file in UTF8 however it seems that non UTF8 file gets 
 indexed in Big5 encoding.
 Here I tried fetching indexed data stream in Big5 and converted in UTF8.
 String id = (String) resulDocument.getFirstValue(attr_content);
 byte[] bytearray = id.getBytes(Big5);
 String utf8String = new String(bytearray, UTF-8);
 It does not gives expected results.
 When I index UTF-8 file it indexes like following
 - doc
 - arr name=attr_content
   strマイ ネットワーク/str
   /arr
 - arr name=attr_content_encoding
   strUTF-8/str
   /arr
 - arr name=attr_stream_content_type
   strtext/plain/str
   /arr
 - arr name=attr_stream_name
   strsample_jap_unicode.txt/str
   /arr
 - arr name=attr_stream_size
   str28/str
   /arr
 - arr name=attr_stream_source_info
   strmyfile/str
   /arr
 - arr name=content_type
   strtext/plain/str
   /arr
   str name=iddoc2/str
   /doc
 So, I can index and search UTF-8 data.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.

2011-02-02 Thread Prasad Deshpande (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasad Deshpande updated SOLR-2346:
---

Description: 
I am able to successfully index/search non-Engilsh files (like Hebrew, 
Japanese) which was encoded in UTF-8. However, When I tried to index data which 
was encoded in local encoding like Big5 for Japanese I could not see the 
desired results. The contents after indexing looked garbled for Big5 encoded 
document when I searched for all indexed documents. When I index attached non 
utf-8 file it indexes in following way

- result name=response numFound=1 start=0
- doc
- arr name=attr_content
  str�� ��/str
  /arr
- arr name=attr_content_encoding
  strBig5/str
  /arr
- arr name=attr_content_language
  strzh/str
  /arr
- arr name=attr_language
  strzh/str
  /arr
- arr name=attr_stream_size
  str17/str
  /arr
- arr name=content_type
  strtext/plain/str
  /arr
  str name=iddoc2/str
  /doc
  /result
  /response

Here you said it index file in UTF8 however it seems that non UTF8 file gets 
indexed in Big5 encoding.
Here I tried fetching indexed data stream in Big5 and converted in UTF8.

String id = (String) resulDocument.getFirstValue(attr_content);
byte[] bytearray = id.getBytes(Big5);
String utf8String = new String(bytearray, UTF-8);
It does not gives expected results.

When I index UTF-8 file it indexes like following

- doc
- arr name=attr_content
  strマイ ネットワーク/str
  /arr
- arr name=attr_content_encoding
  strUTF-8/str
  /arr
- arr name=attr_stream_content_type
  strtext/plain/str
  /arr
- arr name=attr_stream_name
  strsample_jap_unicode.txt/str
  /arr
- arr name=attr_stream_size
  str28/str
  /arr
- arr name=attr_stream_source_info
  strmyfile/str
  /arr
- arr name=content_type
  strtext/plain/str
  /arr
  str name=iddoc2/str
  /doc

So, I can index and search UTF-8 data.


For more reference below is the discussion with Yonik.
Please find attached TXT file which I was using to index and search.

curl 
http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8;
 -F myfile=@sample_jap_non_UTF-8


One problem is that you are giving big5 encoded text to Solr and saying that 
it's UTF8.
Here's one way to actually tell solr what the encoding of the text you are 
sending is:

curl 
http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true;
 --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; 
charset=big5'

Now the problem appears that for some reason, this doesn't work...
Could you open a JIRA issue and attach your two test files?

-Yonik
http://lucidimagination.com




  was:
I am able to successfully index/search non-Engilsh files (like Hebrew, 
Japanese) which was encoded in UTF-8. However, When I tried to index data which 
was encoded in local encoding like Big5 for Japanese I could not see the 
desired results. The contents after indexing looked garbled for Big5 encoded 
document when I searched for all indexed documents. When I index attached non 
utf-8 file it indexes in following way

- result name=response numFound=1 start=0
- doc
- arr name=attr_content
  str�� ��/str
  /arr
- arr name=attr_content_encoding
  strBig5/str
  /arr
- arr name=attr_content_language
  strzh/str
  /arr
- arr name=attr_language
  strzh/str
  /arr
- arr name=attr_stream_size
  str17/str
  /arr
- arr name=content_type
  strtext/plain/str
  /arr
  str name=iddoc2/str
  /doc
  /result
  /response

Here you said it index file in UTF8 however it seems that non UTF8 file gets 
indexed in Big5 encoding.
Here I tried fetching indexed data stream in Big5 and converted in UTF8.

String id = (String) resulDocument.getFirstValue(attr_content);
byte[] bytearray = id.getBytes(Big5);
String utf8String = new String(bytearray, UTF-8);
It does not gives expected results.

When I index UTF-8 file it indexes like following

- doc
- arr name=attr_content
  strマイ ネットワーク/str
  /arr
- arr name=attr_content_encoding
  strUTF-8/str
  /arr
- arr name=attr_stream_content_type
  strtext/plain/str
  /arr
- arr name=attr_stream_name
  strsample_jap_unicode.txt/str
  /arr
- arr name=attr_stream_size
  str28/str
  /arr
- arr name=attr_stream_source_info
  strmyfile/str
  /arr
- arr name=content_type
  strtext/plain/str
  /arr
  str name=iddoc2/str
  /doc

So, I can index and search UTF-8 data.




 Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no 
 getting indexed correctly.
 ---

 Key: SOLR-2346
 URL: https://issues.apache.org/jira/browse/SOLR-2346
 Project: Solr
  Issue Type: Bug
  Components: contrib - Solr Cell