[ https://issues.apache.org/jira/browse/LUCENE-589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir reassigned LUCENE-589: ---------------------------------- Assignee: Robert Muir > Demo HTML parser doesn't work for international documents > --------------------------------------------------------- > > Key: LUCENE-589 > URL: https://issues.apache.org/jira/browse/LUCENE-589 > Project: Lucene - Java > Issue Type: Improvement > Components: Examples > Affects Versions: 2.0.0 > Reporter: Curtis d'Entremont > Assignee: Robert Muir > Priority: Minor > > Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally > it would read the charset from the HTML markup, but that can by tricky. For > now assuming unicode would do the trick: > Add the following line marked with a + to HTMLParser.jj: > options { > STATIC = false; > OPTIMIZE_TOKEN_MANAGER = true; > //DEBUG_LOOKAHEAD = true; > //DEBUG_TOKEN_MANAGER = true; > + UNICODE_INPUT = true; > } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org