[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu

Vilaythong Southavilay (JIRA) Tue, 19 Jan 2010 16:37:20 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802568#action_12802568
 ]


Vilaythong Southavilay commented on LUCENE-1488:
------------------------------------------------

I am developing an IR system for Lao. I've been searching for this kind of 
analyzers to be used in my development to index documents containing languages 
like Lao, French and English in one single passage.

I tested it for Lao language for Lucene 2.9 and 3.0 using my short passage. It 
worked correctly for both versions as I expected, especially for segmenting Lao 
single syllables. I also tried it with the bi-gram filter option for two 
syllables, which worked fine for simple words. The result contained some 
two-syllable words which do not make sense in Lao language. I guess this not a 
big issue. As Robert pointed out (in an email to me), we still need 
dictionary-based word segmentation for Lao, which can be integrated in ICU and 
used by this analyzer.

Any way, thanks for your assistance. This work will be helpful not only for 
Lao, but others as well because it's good to have a common analyzer for unicode 
characters.

I'll continue testing it and report any problems if I find one. 

> multilingual analyzer based on icu
> ----------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, 
> LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu

Reply via email to