Hello Doug,

that sounds interesting to me. I refer to a paper written by NIST about
Relevance Feedback which was doing test with 20 - 200 words. This is why I
thought it might be good to be able to use all non stopwords of a document for that
and see what is happening. Do you know good papers about strategies of how
to select keywords effectivly beyond the scope of stopword lists and stemming?

Using term frequencies of the document is not really possible since lucene
is not providing access to a document vector, isn't it?

By the way, could you send me the code of Dmitry about the Vector extension.
I have been asking in another thread but I did not get it so far. I really
would like to have a look... Also it would be nice to know about any status
regarding the progress of integrating it in Lucene 1.3. Who is working on it
and how could I contribute?

Cheers,
Karl


> Andrzej Bialecki wrote:
> > Karl Koch wrote:
> >> I actually wanted to add a large amount of text from an existing 
> >> document to
> >> find a close related one. Can you suggest another good way of doing 
> >> this.
> >
> > You should try to reduce the dimensionality by reducing the number of 
> > unique features. In this case, you could for example use only keywords 
> > (or key phrases) instead of the full content of documents.
> 
> Indeed, this is a good approach.  In my experience, six or eight terms 
> are usually enough, and they needn't all be required.
> 
> Doug
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to