Robert, Yes, this issue will not work, as some numbers are used to represent (transliterate if I may say) some English letters (e.g. 3 for Arabic Aeen, and 7 for Arabic H'a).
Some online services provide instant translation for such transliteration (e.g. http://www.yamli.com/ try this word "7elo" it means nice/cool in Arabic), so we may provide analyzer stage that could translate such content to Arabic :) Basem On Thu, Oct 8, 2009 at 5:11 PM, Robert Muir <rcm...@gmail.com> wrote: > Uwe, I might add to what you say. I do disagree a bit and think mixed > english/arabic text is pretty common (aside from the "product name" issue > you discussed). > > this can get really complex for some informal text: you have maybe some > english, arabic, and arabic written in informal romanization, sometimes all > mixed together: > > Example: > http://www.mahjoob.com/en/forums/showthread.php?t=211597&page=3 > > Not really sure how to make the default ArabicAnalyzer to meet everyone's > needs, in this example its gonna screw up the romanized arabic, because they > use numerics for some letters, and it uses something based on CharTokenizer > :) But allowing a word to say, start with or contain a numeric, this might > not be the best thing for higher-quality text... > > > On Thu, Oct 8, 2009 at 9:56 AM, Uwe Schindler <u...@thetaphi.de> wrote: >> >> I think the idea of lowercase filter in the arabic analyzers is not to >> really index mixed language texts. It is more for the case, if you have >> some >> word between the Arabic content (like product names,.), which happens >> often. >> You see this often also in Japanese texts. And for these embedded English >> fragments you really need no stop word list. And if there is a stop word >> in >> it, for the target language it is not a real stop word, it may be >> additional >> information. Stop word removal is done mostly because of they are needless >> (appear in every text). But if you have one Arabic sentence where "the" >> also >> appears next to an English word, it is more important than all the "the" >> in >> this mail. >> >> >> Uwe >> >> ----- >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> > > > > -- > Robert Muir > rcm...@gmail.com > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org