I did all the changes but no improvement. the data is getting indexed properly, I think because I'm able to see the results through luke and luke has option for seeing the results in both utf-8 encoding and string default encoding. I tried to use both but no difference. In both the cases I'm able to see the regional text. but no through the browser . How to decoding when fetching the search results throught searcher?
Thanks KK On Thu, May 21, 2009 at 1:05 PM, KK <dioxide.softw...@gmail.com> wrote: > Thanks @Uwe. > #To answer your last mails query, textOnly is the output of the method > downloadPage(), complete text thing includeing all html tags etc... > #Instead of doing the encode/decode later, what i should do is when > downloading the page through buffered reader put the charset as utf-8 as you > mentioned in your last mail. so instead of > BufferedReader reader = > new BufferedReader(new InputStreamReader( > pageUrl.openStream())); > > I should do this, > BufferedReader reader = > new BufferedReader(new InputStreamReader( > pageUrl.openStream(), <mention the charset like > Charset.forName("UTF-8")>)); > > right? and remove this conversion that I'm doing later , > > byte [] utfEncodeByteArray = textOnly.getBytes(); > String utfString = new String(utfEncodeByteArray, Charset.forName("UTF- > 8")); > > This will make sure I'm not depending on the platform encoding, right? This > seems to fix my indexing issue. Now regarding searching I dont need to > mention any charset thing there, I'm using stardard anyalyzer? As I know > lucene stores the chars as raw unicode so when I present my query in the > same unicode format lucene will give me proper results. Currently I'm not > using the encoding for HTTP parameters, I'll use that and let you know. > Thank you very much. > > KK, > > > On Thu, May 21, 2009 at 12:50 PM, Uwe Schindler <u...@thetaphi.de> wrote: > >> I forgot: >> >> > byte [] utfEncodeByteArray = textOnly.getBytes(); >> > String utfString = new String(utfEncodeByteArray, Charset.forName("UTF- >> > 8")); >> > >> > here textonly is the text extracted from the downloaded page >> >> What is textonly here? A String, if yes, why decode and then again encode >> it? The important thing is: >> Strings in Java are always invariant to charsets (internally they are >> UTF-16). So if you convert a byte array to a string you have to specify a >> charset (as you have done in new String code). If you convert a String to >> a >> byte array, you must do the same. >> >> As mentioned in the mail before, the same is true, when converting >> InputStreams to Readers and Writers to OutputStreams (this can be done >> using >> the converter). >> >> And: If you get a String from somewhere, that looks bad, you cannot >> convert >> the String to another encoding, it was corrupted during conversion to >> string >> before. >> >> E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify the >> input encoding of the HTTP parameters and so on. >> >> Uwe >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >