Re: problems with search on Russian content

Karl ďż˝ie Fri, 22 Nov 2002 00:39:14 -0800

Hi i took a look at Andrey Grishin russian character problem and found something strange happening while we tried to debug it. It seems that he has avoided the usual "querying with different encoding than indexed" problem as he can dump out correctly encoded russian at all points in his application.

Is the strings for terms treated differently than the text stored in text fields? The reason i ask is that his russian words are correct in the stored text fields, but shows up faulty in a terms() dump. If he had a character encoding problem in his application the fields should show up faulty as well i think. Even stranger is that i use Lucene 1.2 successfully for utf-8, iso-8859-1, iso-8859-5 and iso-8859-7. Why is this problem showing in russian(Cp1251) and not the other encodings?

Strangeness number two is the theory that if the russian word ",!,_,U" was skewed to say "0d66539qw" upon indexing, and the problem was just a consistent encoding problem, wouldn't a query with ",!,_,U" be skewed to "0d66539qw" and be found anyway?

mvh karl )*ie

Begin forwarded message:

From: "Andrey Grishin" <[EMAIL PROTECTED]>
Date: Thu Nov 21, 2002  15:13:33 Europe/Oslo
To: "Karl Oie" <[EMAIL PROTECTED]>
Subject: Re: How to include strange characters??


yes, you are right - there are no russian words in returned terms :(((
I've just executed the following
--------------
IndexReader r =
IndexReader.open("C:\\j\\jakarta-tomcat-4.1.12\\index\\ukrenergo");
TermEnum e = r.terms();
while (e.next()) {
  Term term = (Term) e.term();
  System.out.println("term : " + term.text());
}
--------------
and got no russian words in result
there are some "strange" terms returned instead of russian:
term : 0d4xvp70w
term : 0d66539qw
term : 0d67les2o
term : 0d6eqgic0
etc.....

So, I think we got a problem. THis is great :)), thank you...
but how to fix it?




----- Original Message -----
From: "Karl ?e" <[EMAIL PROTECTED]>
To: "Andrey Grishin" <[EMAIL PROTECTED]>
Sent: Thursday, November 21, 2002 3:56 PM
Subject: Re: How to include strange characters??


another thing to check is weither the IndexReader.terms() actually
contains your term.

mvh karl oie

On Thursday, Nov 21, 2002, at 14:31 Europe/Oslo, Andrey Grishin wrote:

Karl,
I have the same problem with lucene search within russian content.
I tried all your advises, but lucene still can't find anything :((((
I indexed the content using Cp1251 charset
------------
text = new String(text.getBytes("Cp1251"));
doc.add(Field.Text(CONTENT_FIELD,text));
------------
and I am searching using the same charset
String txt = ",!,_,U";
txt = new String(txt.getBytes("Cp1251"));
PrefixQuery query = new PrefixQuery(new
Term(PortalHTMLDocument.CONTENT_FIELD, txt));
hits = searcher.search(query);

and lucene can't find nothing.
Also I checked for the DecodeInterceptor in my server.xml - there
isn't any
I tried UTF-8/16 - and got the same result.
if I list all index's content via iterating IndexReader- I can see
that my russian content is stored in index...
Can you please help me? Do you have any more ideas about what else can
be done here to fix this problem?

I will appreciate any help.
Thanks, Andrey.

P.S.
I am using lucene 1.2, tomcat 4.1.12, jdk 1.4.1 on Win2000 AS


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: problems with search on Russian content

Reply via email to