[jira] [Reopened] (LUCENE-5447) StandardTokenizer should break at consecutive chars matching Word_Break = MidLetter, MidNum and/or MidNumLet

Steve Rowe (JIRA) Wed, 19 Feb 2014 08:49:53 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Rowe reopened LUCENE-5447:
--------------------------------

    Lucene Fields: New,Patch Available  (was: New)

In looking at the committed diffs (when JIRA was down last night and earlier 
today, the lucene_solr_4_7 commit didn't put a comment on this issue, which 
sucks), I see that I didn't fully patch StandardTokenizerImpl.jflex, although I 
*did* correctly patch UAX29URLEmailTokenizerImpl, which is basically a superset 
of StandardTokenizerImpl.jflex.

I've added some more tests to show the problem (existing tests didn't fail), 
patch forthcoming.  Here's an example that should be split by StandardTokenizer 
but isn't currently - the issue is triggered via a preceding char matching 
{{Word_Break = ExtendNumLet}}, e.g. the underscore character:

{{A:B_A::B}} <- left intact, but should output "{{A:B_A}}", "{{B}}"

By contrast, the current UAX29URLEmailTokenizer gets the above right.

In the JFlex 1.5.0 release, I added the ability to include external files into 
the rules section of the scanner specification, and I want to take advantage of 
this to refactor StandardTokenizer and UAX29URLEmailTokenizer so that there is 
only one definition of the shared rules.  (That would have prevented the 
problem for which I'm reopening this issue.)  I'll make a separate issue for 
that.

> StandardTokenizer should break at consecutive chars matching Word_Break = 
> MidLetter, MidNum and/or MidNumLet
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5447
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5447
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.6.1
>            Reporter: Steve Rowe
>            Assignee: Steve Rowe
>             Fix For: 4.7, 5.0
>
>         Attachments: LUCENE-5447-test.patch, LUCENE-5447.patch, 
> LUCENE-5447.patch
>
>
> StandardTokenizer should split all of the following sequences into two tokens 
> each, but they are all instead kept intact and output as single tokens:
> {noformat}
> "A::B"           (':' is in \p{Word_Break = MidLetter})
> "1..2", "A..B"   ('.' is in \p{Word_Break = MidNumLet})
> "A.:B"
> "A:.B"
> "1,,2"           (',' is in \p{Word_Break = MidNum})
> "1,.2"
> "1.,2"
> {noformat}
> Unfortunately, the word break test data released with Unicode, e.g. for 
> Unicode 6.3 
> [http://www.unicode.org/Public/6.3.0/ucd/auxiliary/WordBreakTest.txt], and 
> incorporated into a versioned Lucene test, e.g. 
> {{WordBreakTestUnicode_6_3_0}}, doesn't cover these cases.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Reopened] (LUCENE-5447) StandardTokenizer should break at consecutive chars matching Word_Break = MidLetter, MidNum and/or MidNumLet

Reply via email to