[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Robert Muir (JIRA) Sat, 13 Jun 2009 07:49:38 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719126#action_12719126
 ]


Robert Muir commented on LUCENE-1488:
-------------------------------------

Michael, I don't think it will be ready for 2.9, here is some answers to your 
questions.

going with your arabic example:
The only thing this absorbs is language-specific tokenization (like 
ArabicLetterTokenizer), because as mentioned I think thats generally the wrong 
approach.
But this can't replace ArabicAnalyzer completely, because ArabicAnalyzer stems 
arabic text in a language-specific way, which has a huge effect on retrieval 
quality for Arabic language text.

Some of what it does the language-specific analyzers don't do though.

In this specific example, it would be nice if ArabicAnalyzer really used the 
functionality here, then did its Arabic-specific stuff!
Because this functionality will do things like normalize 'Arabic Presentation 
Forms' and deal with Arabic digits and things that aren't in the 
ArabicAnalyzer. It also will treat any non-Arabic text in your corpus very 
nicely!

Yes, you are correct about the difference from StandardAnalyzer and I would 
argue there are tokenization bugs in how StandardAnalyzer works with European 
languages too, just see LUCENE-1545!

I know StandardAnalyzer does these things. This tokenizer has some built-in 
types already, such as number. If you want to add more types, its easy. Just 
make a .txt file with your grammar, create a RuleBasedBreakIterator with it, 
and pass it along to the tokenizer constructor. you will have to subclass the 
tokenizer's getType() for any new types though, because RBBI 'types' are really 
just integer codes in the rule file, and you have to map them to some text such 
as "WORD".

Yes, case-folding will work better than lowercase for a few european languages.


> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Reply via email to