Hi,
Thanks for the tip, but that didn't work in my case. Presumably with this patch, and the changes in CVS, this makes the parser work with UTF-16. I can't really tell because the index appears now to be completely UTF-16 and I can't search for anything.
My input is actually UTF-8 anyway, and if I patch all the streams to use UTF-8 instead of UTF-16, I get parser errors.
So I'm stuck.
Thanks for your help,
Fred
At 09:46 PM 9/24/2004, [EMAIL PROTECTED] wrote:
In org.apache.lucene.demo.HTMLDocument you need to change the input stream to use a different encoding. Replace the fis with this:
fis = new InputStreamReader(new FileInputStream(f), "UTF-16");
-----Original Message----- From: Fred Toth [mailto:[EMAIL PROTECTED] Sent: Friday, September 24, 2004 9:25 PM To: Lucene Users List Subject: Re: demo IndexHTML parser breaks unicode?
Sorry, that didn't cure it.
Again, anyone want to point me to the quickest replacement HTML parser (that's unicode clean)?
Thanks,
Fred
At 03:17 PM 9/24/2004, you wrote: >On Friday 24 September 2004 19:58, Fred Toth wrote: > > > I've got unicode in my source HTML. In particular, within meta tags, > > and it's getting broken by the indexer. Note that I'm not trying to > > query on any of this, just store and retrieve document titles with > > unicode characters. > >Please try again with the code from CVS, Christoph Goller committed a fix >for this problem (at least I think it was this problem) 1-3 weeks ago. > >Regards > Daniel > >-- >http://www.danielnaber.de > >--------------------------------------------------------------------- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
