Sounds like your most difficult part will be the question parser using POS.
This is kind of old school but use something like the AliceBot AIML library http://en.wikipedia.org/wiki/AIML Where the subjective terms can be extracted from the questions, and indexed separately. Or as Grant and others suggest use OpenNLP (which rocks) or LingPipe (LingPipe license is a little bit of a pain) for entity extraction. An interesting way to look at the data would be to construct 3 fields, Original_Question, Question_base, Subject Doc: Original_Question: Who is the president of the UN Question_base: Who is the president of Question_base: Who is Subject: the president of the UN Subject: the president Subject: the UN /Doc And similarity can be somewhat easier to calculate with similar question bases, subjects, etc P On Thu, Mar 5, 2009 at 3:05 PM, Grant Ingersoll <gsing...@apache.org> wrote: > Hi Seid, > > Do you have a reference for the article? I've done some QA in my day, but > don't recall reading that one. > > At any rate, I do think it is possible to do what you are after. See > below. > > On Mar 5, 2009, at 9:49 AM, Seid Mohammed wrote: > > For my work, I have read an article stating that " Answer type can be >> automatically constructed by Indexing Different Questions and Answer >> types. Later, when an unseen question apears, answer type for this >> question will be found with the help of 'similarity function' >> computation" >> >> so I am clear with the arguement above. my problem is, >> 1. how can I index individual questions and Answer types as is ( not >> tokenized >> > > I'm not sure you want this, but when constructing your Field, just use the > NOT_ANALYZED option. > > >> 2. how can I calculate the similarity between indexed questions and >> and unseen questions (question of any type that can be asked latter) >> > > In line with #1, I think you might be better off to actually tokenize the > question as one one field, and the answer type as a second field. Then, you > can let Lucene calculate similarity via it's normal query mechanisms. In > this case, I would like try experimenting with things like: exact match, > phrase queries with slop, etc. That way, not only can you match "Who is the > president of UN" but you might also match on things that are a bit fuzzier. > To do this, you might need to have several fields per document with > variations. I could also see using Lucene's payload mechanism as well. > > But, as Vasu said, you will likely need other parts too, like OpenNLP. > > HTH, > Grant > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >