RE: TermVector support

Alex Murzaku Fri, 30 Nov 2001 07:34:53 -0800

A crude way of finding 'like-documents' would be to just take the document
on hand and submit it as a query. Of course, before submitting it, you will
need to "analyze" it so that the query does not become huge (and slow.)


One way of doing it, would be to transform the vector of words and
corresponding frequencies into boosting weights. Before that though, you
will need to remove anything that looks like junk or that is one of the stop
words (and this could be done by one of the analyzers.)

But you need to go one step further -- you need to retain only the best
words. From what I have seen it seems that the best words for classification
or clustering would be somewhere in the middle of a tf-idf classification of
the document terms. The terms in the higher band would be very specific to
the document in hand and those in the lower band would be most likely stop
words. Those in the middle instead would become what unifies the documents
with each-other.

My interest being in dynamic categorization and clustering, I experimented
with this approach and achieved 80% accuracy in classifying the
Reuters-21578 corpus. Actually it could have been even higher if I computed
as "correct class" the documents that, for example, should have gone under
dollars but went instead under foreign exchange.

Keep in mind that this classification is dynamic because the tf-idf world
changes with the addition of new documents. For example, if at a given
moment you get a document containing the first references of XYZ, the term
XYZ will get the highest tf-idf value. But if every following document
contains XYZ, then XYZ's tf-idf will drop towards the stop words values.

Has anyone else experimented with this. Maybe Dmitry can tell us more about
the quality of results with his approach.

--
Alex

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Thursday, November 29, 2001 9:56 PM
To: [EMAIL PROTECTED]
Subject: TermVector support


I noticed a post discussing term vectors on the lucene-dev list. It is great
that people are working on adding term vectors to Lucene. Stored term
vectors per field (or document) are a great way to get lucene into
classification of documents. There are great many text classification
algorithms that have primarily two inputs: term vector for the whole
collection of documents and term vector per each document. A nice feature to
add using the vectors would be 'like-documents': given an indexed document,
what other documents are there in the index that are like it. Another
feature can be clustering of documents into categories based on the vectors.

I am curious when the term vector support that you worked on will be added
to the main build branch. I couldn't find any information on that.

-Emile

--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

RE: TermVector support

Reply via email to