On Jun 23, 2004, at 7:57 PM, Stefan Groschupf wrote:
Is there a best practice to get the term vector of an document?

You get term vectors by "field", not document. As for best practice, there is really only one way to go about getting the term vectors:


        TermFreqVector termFreqVector =
            reader.getTermFreqVector(i, "subject");

where reader is an IndexReader and i is the document number.

Is there any experience to do any kind of feature selection for dimension reducing like zipf laws or getting tf/idf of a term for the complete corpora.

Document frequency of a term is in the term info file (.tis) and available through TermEnum (see IndexReader.terms()).


        Erik


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to