A crude way of finding 'like-documents' would be to just take the document on hand and submit it as a query. Of course, before submitting it, you will need to "analyze" it so that the query does not become huge (and slow.)
One way of doing it, would be to transform the vector of words and corresponding frequencies into boosting weights. Before that though, you will need to remove anything that looks like junk or that is one of the stop words (and this could be done by one of the analyzers.) But you need to go one step further -- you need to retain only the best words. From what I have seen it seems that the best words for classification or clustering would be somewhere in the middle of a tf-idf classification of the document terms. The terms in the higher band would be very specific to the document in hand and those in the lower band would be most likely stop words. Those in the middle instead would become what unifies the documents with each-other. My interest being in dynamic categorization and clustering, I experimented with this approach and achieved 80% accuracy in classifying the Reuters-21578 corpus. Actually it could have been even higher if I computed as "correct class" the documents that, for example, should have gone under dollars but went instead under foreign exchange. Keep in mind that this classification is dynamic because the tf-idf world changes with the addition of new documents. For example, if at a given moment you get a document containing the first references of XYZ, the term XYZ will get the highest tf-idf value. But if every following document contains XYZ, then XYZ's tf-idf will drop towards the stop words values. Has anyone else experimented with this. Maybe Dmitry can tell us more about the quality of results with his approach. -- Alex -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Thursday, November 29, 2001 9:56 PM To: [EMAIL PROTECTED] Subject: TermVector support I noticed a post discussing term vectors on the lucene-dev list. It is great that people are working on adding term vectors to Lucene. Stored term vectors per field (or document) are a great way to get lucene into classification of documents. There are great many text classification algorithms that have primarily two inputs: term vector for the whole collection of documents and term vector per each document. A nice feature to add using the vectors would be 'like-documents': given an indexed document, what other documents are there in the index that are like it. Another feature can be clustering of documents into categories based on the vectors. I am curious when the term vector support that you worked on will be added to the main build branch. I couldn't find any information on that. -Emile -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
