By the way, I was wondering if there is any Analyzer that uses the following constructor public Token(String text, int start, int end, String typ) ?
Maybe it could be interesting to build an analyzer that recognizes punctuation marks and keeps it in the index as Tokens with a given Type (say for example "punctuation") ? The advantage is that information could be used by a SloppyPhraseScorer.phraseFreq() method to avoid PhraseQuery containing a punctuation mark. Since PhraseQueries are used for compound words (e.g. "personal computer") with a given slop value (say 3), it could be great not to match things such as "It is not personal. My computer hates me..." . A solution could be to set a slop value of zero, but it is not possible in my case (I use a module that generates compound terms with slop values, in order to handle morphologic variations - eg in French "gestion de la casse" and "gestion des casses" which are represented by "gestion casse"^3 and "gestion casses"^3). This involves creating a subclasse of PhraseQuery or modifing it by adding a boolean to it and modifying the phraseFreq() method so that it checks that there is no Token with a punctuation Type in the scope of the slop. What do you think about it? Has anyone already tried in that direction? Does it implies heavy changes? Hugo : maybe you could store your stopwords as tokens with a different type? ----- Original Message ----- From: "hugo burm" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, February 13, 2002 5:32 PM Subject: How does Lucene handle phrases containing words that are not indexed? > > How does Lucene handle phrases (literals) containing words that are not > indexed? (e.g. stopwords, one-letter words, numbers)? I did some tests > (lucene demo, my own 120000 xml documents, Cocoon search) and in all cases > it looks like that when you are looking for the phrase "a specification" it > also finds documents which contain "the specification". (or: "D. Washington" > instead of "G. Washington"). > > Of course you can change the index behaviour and make sure there are no > stopwords, and all one-letter words and numbers are indexed. But that seems > a bad approach. A better approach: 1) find all indexed words in the phrase > and from these words find all documents containing these words. 2) check the > occurence of the phrase by opening the original document. I am wondering: > does Lucene performs step 2)? Off course this step burns some cpu cycles. > > Hugo > > [EMAIL PROTECTED] > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>