Hi, I am trying to get the Persian part of Lucene to work but apparently the current implementation is just a simple version of sopt word tokenizer and no stemmer, etc. I was trying to find the contact of the person who had done this but couldn't find it any where in the code.
In addition, I went through the source and even made some class diagrams out of the code just to understand the project better. In fact I was looking for a TokenStream that can give me the previous tokens in a stream but apparently all the existing classes can only traverse forward and not backward. The problem that I am facing with a Persian Stemmer is that the verbs in Persian could be made of multiple words. A simple example of that in English would be something like the verb "give up" which has a completely different meaning than "give" or "up": We had *give*n the dog *up* as lost. So, a proper search query should understand this and give us the right search results. In English it is easier to find such verbs because the main verb (give) comes first and the second word (up) comes next. But in Persian it is usually other way. Something like: We had *up* the dog *given* as lost. Now when you reach the token "given", you really need to know if this verb is a plain verb or a complex verb. Therefore, you have to find the token "up" in the stream in order to populate the correct verb. So, please correct me if I am wrong. Provided this requirement, my understanding is that we need a new TokenStream that holds a few of the previous tokens in a list or an array. If this is correct, please let me know how I can make such a class. And what are the considerations that I should keep in mind. Things like memory consumption, performance, being loyal to the architecture, etc. etc. Your help will be greatly appreciated! Thanks, -Patrick