I forgot: > byte [] utfEncodeByteArray = textOnly.getBytes(); > String utfString = new String(utfEncodeByteArray, Charset.forName("UTF- > 8")); > > here textonly is the text extracted from the downloaded page
What is textonly here? A String, if yes, why decode and then again encode it? The important thing is: Strings in Java are always invariant to charsets (internally they are UTF-16). So if you convert a byte array to a string you have to specify a charset (as you have done in new String code). If you convert a String to a byte array, you must do the same. As mentioned in the mail before, the same is true, when converting InputStreams to Readers and Writers to OutputStreams (this can be done using the converter). And: If you get a String from somewhere, that looks bad, you cannot convert the String to another encoding, it was corrupted during conversion to string before. E.g. in a WebAppclcation, use ServletRequest.setEncoding() to specify the input encoding of the HTTP parameters and so on. Uwe --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org