Basem, yeah such an analyzer, that could somehow do something nice with this transliterated arabic chat, I think it would be a cool feature for forums and such in the future.
On Thu, Oct 8, 2009 at 4:33 PM, Basem Narmok <nar...@gmail.com> wrote: > Robert, > > Yes, this issue will not work, as some numbers are used to represent > (transliterate if I may say) some English letters (e.g. 3 for Arabic > Aeen, and 7 for Arabic H'a). > > Some online services provide instant translation for such > transliteration (e.g. http://www.yamli.com/ try this word "7elo" it > means nice/cool in Arabic), so we may provide analyzer stage that > could translate such content to Arabic :) > > Basem > > On Thu, Oct 8, 2009 at 5:11 PM, Robert Muir <rcm...@gmail.com> wrote: > > Uwe, I might add to what you say. I do disagree a bit and think mixed > > english/arabic text is pretty common (aside from the "product name" issue > > you discussed). > > > > this can get really complex for some informal text: you have maybe some > > english, arabic, and arabic written in informal romanization, sometimes > all > > mixed together: > > > > Example: > > http://www.mahjoob.com/en/forums/showthread.php?t=211597&page=3 > > > > Not really sure how to make the default ArabicAnalyzer to meet everyone's > > needs, in this example its gonna screw up the romanized arabic, because > they > > use numerics for some letters, and it uses something based on > CharTokenizer > > :) But allowing a word to say, start with or contain a numeric, this > might > > not be the best thing for higher-quality text... > > > > > > On Thu, Oct 8, 2009 at 9:56 AM, Uwe Schindler <u...@thetaphi.de> wrote: > >> > >> I think the idea of lowercase filter in the arabic analyzers is not to > >> really index mixed language texts. It is more for the case, if you have > >> some > >> word between the Arabic content (like product names,.), which happens > >> often. > >> You see this often also in Japanese texts. And for these embedded > English > >> fragments you really need no stop word list. And if there is a stop word > >> in > >> it, for the target language it is not a real stop word, it may be > >> additional > >> information. Stop word removal is done mostly because of they are > needless > >> (appear in every text). But if you have one Arabic sentence where "the" > >> also > >> appears next to an English word, it is more important than all the "the" > >> in > >> this mail. > >> > >> > >> Uwe > >> > >> ----- > >> Uwe Schindler > >> H.-H.-Meier-Allee 63, D-28213 Bremen > >> http://www.thetaphi.de > >> eMail: u...@thetaphi.de > >> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-dev-h...@lucene.apache.org > >> > > > > > > > > -- > > Robert Muir > > rcm...@gmail.com > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com