[
https://issues.apache.org/jira/browse/LUCENE-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13512102#comment-13512102
]
James Dyer commented on LUCENE-4587:
------------------------------------
Steven,
Thank you for taking time to review this. You're right the regex is better but
probably the 4 chars are ok as this mimics what MockTokenizer will split on.
Working on this made me wonder if perhaps WordBreakSpellChecker itself could be
made more useful for non-western languages if it was configurable to
break/combine with/on characters other than the space. I have very little of a
linguistic background so I'm not sure if there is a solid use-case for this or
how would it work. My guess is it would be too complicated for now if even
useful at all. But if anyone has thoughts in this direction I wouldn't mind
hearing them.
> WordBreakSpellChecker treats bytes as chars
> -------------------------------------------
>
> Key: LUCENE-4587
> URL: https://issues.apache.org/jira/browse/LUCENE-4587
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/spellchecker
> Affects Versions: 4.0
> Reporter: Andreas Hubold
> Assignee: James Dyer
> Fix For: 4.1, 5.0
>
> Attachments: LUCENE-4587.patch
>
>
> Originally opened as SOLR-4115.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]