[ 
https://issues.apache.org/jira/browse/LUCENE-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907186#comment-13907186
 ] 

Steve Rowe commented on LUCENE-5447:
------------------------------------

bq. In the JFlex 1.5.0 release, I added the ability to include external files 
into the rules section of the scanner specification, and I want to take 
advantage of this to refactor StandardTokenizer and UAX29URLEmailTokenizer so 
that there is only one definition of the shared rules. (That would have 
prevented the problem for which I'm reopening this issue.) I'll make a separate 
issue for that.

See LUCENE-5464

> StandardTokenizer should break at consecutive chars matching Word_Break = 
> MidLetter, MidNum and/or MidNumLet
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5447
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5447
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.6.1
>            Reporter: Steve Rowe
>            Assignee: Steve Rowe
>             Fix For: 4.7, 5.0
>
>         Attachments: LUCENE-5447-take2.patch, LUCENE-5447-test.patch, 
> LUCENE-5447.patch, LUCENE-5447.patch
>
>
> StandardTokenizer should split all of the following sequences into two tokens 
> each, but they are all instead kept intact and output as single tokens:
> {noformat}
> "A::B"           (':' is in \p{Word_Break = MidLetter})
> "1..2", "A..B"   ('.' is in \p{Word_Break = MidNumLet})
> "A.:B"
> "A:.B"
> "1,,2"           (',' is in \p{Word_Break = MidNum})
> "1,.2"
> "1.,2"
> {noformat}
> Unfortunately, the word break test data released with Unicode, e.g. for 
> Unicode 6.3 
> [http://www.unicode.org/Public/6.3.0/ucd/auxiliary/WordBreakTest.txt], and 
> incorporated into a versioned Lucene test, e.g. 
> {{WordBreakTestUnicode_6_3_0}}, doesn't cover these cases.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to