Thanks @Uwe. #To answer your last mails query, textOnly is the output of the method downloadPage(), complete text thing includeing all html tags etc... #Instead of doing the encode/decode later, what i should do is when downloading the page through buffered reader put the charset as utf-8 as you mentioned in your last mail. so instead of BufferedReader reader = new BufferedReader(new InputStreamReader( pageUrl.openStream()));
I should do this, BufferedReader reader = new BufferedReader(new InputStreamReader( pageUrl.openStream(), <mention the charset like Charset.forName("UTF-8")>)); right? and remove this conversion that I'm doing later , byte [] utfEncodeByteArray = textOnly.getBytes(); String utfString = new String(utfEncodeByteArray, Charset.forName("UTF- 8")); This will make sure I'm not depending on the platform encoding, right? This seems to fix my indexing issue. Now regarding searching I dont need to mention any charset thing there, I'm using stardard anyalyzer? As I know lucene stores the chars as raw unicode so when I present my query in the same unicode format lucene will give me proper results. Currently I'm not using the encoding for HTTP parameters, I'll use that and let you know. Thank you very much. KK, On Thu, May 21, 2009 at 12:50 PM, Uwe Schindler <u...@thetaphi.de> wrote: > I forgot: > > > byte [] utfEncodeByteArray = textOnly.getBytes(); > > String utfString = new String(utfEncodeByteArray, Charset.forName("UTF- > > 8")); > > > > here textonly is the text extracted from the downloaded page > > What is textonly here? A String, if yes, why decode and then again encode > it? The important thing is: > Strings in Java are always invariant to charsets (internally they are > UTF-16). So if you convert a byte array to a string you have to specify a > charset (as you have done in new String code). If you convert a String to a > byte array, you must do the same. > > As mentioned in the mail before, the same is true, when converting > InputStreams to Readers and Writers to OutputStreams (this can be done > using > the converter). > > And: If you get a String from somewhere, that looks bad, you cannot convert > the String to another encoding, it was corrupted during conversion to > string > before. > > E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify the > input encoding of the HTTP parameters and so on. > > Uwe > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >