How do I index a HTM document which may have any encoding like EUC,SJIS,Western European or UTF 8. Can I parse and extract the html into string and than convert into Text file in UNICODE ? Is this an appropiate way to index HTML files ? Can anyone suggest me a simple parser other than a parser found in demo of lucene ?
Also how do i find the "encoding " of files ? Whenever there are ANSI text files containing japanese characters i am not able to convert into UTF-16 lucene is indexing properly when I convert into SJIS thnks chandan --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]