Persian Implementation

Patrick Estarian Mon, 18 Jul 2011 15:25:22 -0700

Hi,

I am trying to get the Persian part of Lucene to work but apparently the
current implementation is just a simple version of sopt word tokenizer and
no stemmer, etc. I was trying to find the contact of the person who had done
this but couldn't find it any where in the code.


In addition, I went through the source and even made some class diagrams out
of the code just to understand the project better. In fact I was looking for
a TokenStream that can give me the previous tokens in a stream but
apparently all the existing classes can only traverse forward and not
backward.

The problem that I am facing with a Persian Stemmer is that the verbs in
Persian could be made of multiple words. A simple example of that in English
would be something like the verb "give up" which has a completely different
meaning than "give" or "up":

   We had *give*n the dog *up* as lost.

So, a proper search query should understand this and give us the right
search results. In English it is easier to find such verbs because the main
verb (give) comes first and the second word (up) comes next. But in Persian
it is usually other way. Something like:

   We had *up* the dog *given* as lost.

Now when you reach the token "given", you really need to know if this verb
is a plain verb or a complex verb. Therefore, you have to find the token
"up" in the stream in order to populate the correct verb.

So, please correct me if I am wrong. Provided this requirement, my
understanding is that we need a new TokenStream that holds a few of the
previous tokens in a list or an array. If this is correct, please let me
know how I can make such a class. And what are the considerations that I
should keep in mind. Things like memory consumption, performance, being
loyal to the architecture, etc. etc.

Your help will be greatly appreciated!

Thanks,
-Patrick

Persian Implementation

Reply via email to