[ 
https://issues.apache.org/jira/browse/LUCENE-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701706#comment-13701706
 ] 

Robert Muir commented on LUCENE-5096:
-------------------------------------

{quote}
The whitespace tokenizer supports only Java whitespace as defined in 
http://docs.oracle.com/javase/6/docs/api/java/lang/Character.html#isWhitespace(char)
{quote}

Not exactly: it uses isWhitespace(int)

{quote}
A useful improvement would be to support also Unicode whitespace as defined in 
the Unicode property list 
http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
{quote}

There are only 4 codepoints listed in there that are not java whitespace:
U+0085
U+00A0
U+2007
U+202F

breaking on the last 3 would violate the intentions of isWhitespace ("but is 
not also a non-breaking space ('\u00A0', '\u2007', '\u202F'). ")
U+0085 is not a whitespace character (despite having this strange White_Space 
property, its category is a control character).

i personally think we are doing the right thing...


                
> WhitespaceTokenizer supports Java whitespace, should also support Unicode 
> whitespace
> ------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5096
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5096
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.3.1
>         Environment: all
>            Reporter: Jörg Prante
>            Priority: Minor
>
> The whitespace tokenizer supports only Java whitespace as defined in 
> http://docs.oracle.com/javase/6/docs/api/java/lang/Character.html#isWhitespace(char)
> A useful improvement would be to support also Unicode whitespace as defined 
> in the Unicode property list 
> http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to