[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Robert Muir (JIRA) Thu, 11 Dec 2008 17:06:07 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655840#action_12655840
 ]


Robert Muir commented on LUCENE-1488:
-------------------------------------

thats a good idea. you know, currently trying to get it to pass all the 
standard analyzer unit tests causes some problems since lucene has some rather 
obscure definitions of 'number' (i think ip addresses, etc are included) which 
differ dramatically from the basic unicode definition.

Other things of note:

instantiating the analyzer takes a long time (couple seconds) because ICU must 
"compile" the rules. I'm not sure of the specifics but by compile I think that 
means building massive FSM or similar based on all the unicode data. Its 
possible to precompile the rules into binary format but I think this is not 
currently exposed in ICU.

the lucene tokenization pipeline makes the implementation a little hairy. I 
hack around it by tokenizing on whitespace first, then acting as a token filter 
(just like the Thai analyzer does, which also uses RBBI). I don't think this 
really is that bad from a linguistic standpoint because the rare cases where 
'token' can have whitespace inside of it (persian, etc) need serious muscle 
somewhere else and should be handled by a language analyzer.

i'll try to get this thing in reasonable shape at least to document the 
approach.


> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Reply via email to