[ 
https://issues.apache.org/jira/browse/LUCENE-589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-589:
-------------------------------

    Attachment: LUCENE-589.patch

attached is a patch, it also fixes LUCENE-2246.

> Demo HTML parser doesn't work for international documents
> ---------------------------------------------------------
>
>                 Key: LUCENE-589
>                 URL: https://issues.apache.org/jira/browse/LUCENE-589
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Examples
>    Affects Versions: 2.0.0
>            Reporter: Curtis d'Entremont
>            Assignee: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-589.patch
>
>
> Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally 
> it would read the charset from the HTML markup, but that can by tricky. For 
> now assuming unicode would do the trick:
> Add the following line marked with a + to HTMLParser.jj:
> options {
>   STATIC = false;
>   OPTIMIZE_TOKEN_MANAGER = true;
>   //DEBUG_LOOKAHEAD = true;
>   //DEBUG_TOKEN_MANAGER = true;
> +  UNICODE_INPUT = true;
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to