[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu

Robert Muir (JIRA) Wed, 02 Dec 2009 15:21:45 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785054#action_12785054
 ]


Robert Muir commented on LUCENE-1488:
-------------------------------------

DM, I really appreciate your review. You have brought up some good ideas that I 
haven't yet thought about.

bq. All I see is a bit of JavaDoc and an extraneous unused variable 
(ICUTokenizer: private PositionIncrementAttribute posIncAtt

Yeah there are some TODOs, and cleanup on the tokenstreams, and the API in 
general. its not easy to customize the way its supposed to be: where you as a 
user can actually supply BreakIterator impls to the tokenizer and say "use 
these rules/dictionary/whatever for tokenizing XYZ script only".

bq. I'm wondering whether it would make sense to have multiple representations 
of a token with the same position in the index. Specifically, transliterations 
and case-folding. That is, the one is a "synonym" for the other. Is that 
possible and does it make sense? I'm imagining a use case where a end user 
enters for a search request a Latin script transliteration of Greek "uios" but 
might also enter "υιος".

Yeah this is something to consider. I don't think it makes sense for the case 
folding filter, but maybe for the transform filter? will have to think about it.
There's use cases here like what you mentioned, also real-world ones like 
invoking Serbian-Latin or something, where you want users to search in either 
writing system and there actually is a clearly defined transformation.

I guess on the other hand, you could always use a separate field (with 
different analysis/transforms) for each and search both.

bq. The other question on my mind is that given a text of German, Greek and 
Hebrew (three distinct scripts) does it make sense to apply stop words to them 
based on script? And should stop words be normalized on load with the 
ICUNormalizationFilter? Or is it a given that they work as is?

You could put them all in one list with regular stopfilter now. They won't 
clash since they are different unicode Strings. Obviously I would normalize 
this list with the same stuff (normalization form/case folding/whatever) that 
your analyzer users.

I don't put any stopwords in this, because thats language dependent, trying to 
stick with language-independent (either stuff that applies to unicode as a 
whole, or specific writing systems, which can be accurately detected).

bq. Can/How does all this integrate with stemmers?

Right this is just supposed to be what "StandardTokenizer"-type stuff does, and 
you would add stemming on top of it. The idea is you would use this even if you 
think you only have english text, maybe then applying your porter english 
stemmer. But if it happens to stumble upon some CJK or Thai or something along 
the way, everything will be ok.

In all honesty, I probably put 90% of the work into the Khmer, Myanmar, Lao, 
etc cases. Having good tokenization I think makes a usable search engine, for a 
lot of languages stemming is only a bonus.

However, one thing it also does is put the script value in the flags for each 
token. This can work pretty well: if its Greek script, its probably Greek 
language, but if its Hebrew script, well it could be Yiddish too.  If its Latin 
script, could be english, german, etc. Its ended only to make life easier since 
the information is already available... but I don't know yet how to make use of 
it in a nice way.

bq. Again, many thanks! (Btw, special thanks for this working with 2.9 and Java 
1.4!)

Yeah i haven't updated it to java 5/Lucene 3.x yet, started working it, but 
kinda forgot about that so far. I guess this is a good thing, so you can play 
with it if you want.



> multilingual analyzer based on icu
> ----------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, 
> LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu

Reply via email to