There are other (more trivial) problems as well. One geek from UFAL (our NLP lab) reported, that it was a hard problem to find the boundaries, or rather, to say whether a dot is a dot or something else, i.e. "blah, i.e. blah" "i.b.m." "i.p. pavlov" "3.14" "28.10.2003" etc.
On the other hand, I would rather like to know the model which is implemented by Lucene. If it is not a vector model, what is it? ;-)
I would call it a vector space model.
The best description of how Lucene scores is:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html
Doug
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
