[jira] [Commented] (LUCENE-4587) WordBreakSpellChecker treats bytes as chars

James Dyer (JIRA) Thu, 06 Dec 2012 12:51:12 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13512102#comment-13512102
 ]


James Dyer commented on LUCENE-4587:
------------------------------------

Steven,

Thank you for taking time to review this.  You're right the regex is better but 
probably the 4 chars are ok as this mimics what MockTokenizer will split on.

Working on this made me wonder if perhaps WordBreakSpellChecker itself could be 
made more useful for non-western languages if it was configurable to 
break/combine with/on characters other than the space.  I have very little of a 
linguistic background so I'm not sure if there is a solid use-case for this or 
how would it work.  My guess is it would be too complicated for now if even 
useful at all.  But if anyone has thoughts in this direction I wouldn't mind 
hearing them. 
                
> WordBreakSpellChecker treats bytes as chars
> -------------------------------------------
>
>                 Key: LUCENE-4587
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4587
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/spellchecker
>    Affects Versions: 4.0
>            Reporter: Andreas Hubold
>            Assignee: James Dyer
>             Fix For: 4.1, 5.0
>
>         Attachments: LUCENE-4587.patch
>
>
> Originally opened as SOLR-4115.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4587) WordBreakSpellChecker treats bytes as chars

Reply via email to