[ 
https://issues.apache.org/jira/browse/LUCENE-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721877#action_12721877
 ] 

Steven Rowe commented on LUCENE-1702:
-------------------------------------

bq. I think for this issue it would be best to wait for the 1.5.0 version of 
jflex for clarity.

+0, in that the arrival time for 1.5.0 is unknown, but I'll defer to your 
judgment.

bq. for reference (haven't looked at jflex), above-bmp support might require 
new data structures. I think ICU uses things like tries / compactarrays to deal 
with the fact you have thousands of codepoints with the same property value, 
etc.

Thanks for the heads-up.  The above-BMP property values for the currently 
supported properties are now encoded on the 1.5 branch as range pairs (they 
just aren't accessible yet because of the BMP limit).  Since JFlex is a regular 
expression engine, code for handling large character sets (as sets of ranges) 
is already built-in, so I don't anticipate this will be a problem.  The main 
thing will just be to switch from char to int for character representation.

> Thai token type() bug
> ---------------------
>
>                 Key: LUCENE-1702
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1702
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>
> While adding tests for offsets & type to ThaiAnalyzer, i discovered it does 
> not type Thai numeric digits correctly.
> ThaiAnalyzer uses StandardTokenizer, and this is really an issue with the 
> grammar, which adds the entire [:Thai:] block to ALPHANUM.
> i propose that alphanum be described a little bit differently in the grammar.
> Instead, [:letter:] should be allowed to have diacritics/signs/combining 
> marks attached to it.
> this would allow the [:thai:] hack to be completely removed, would allow 
> StandardTokenizer to parse complex writing systems such as Indian languages, 
> and would fix LUCENE-1545.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to