Retrieving the term vectors of a document in Nutch

House Less Mon, 08 Jun 2009 15:46:24 -0700

Hello everyone,

I am quite new to development with Nutch, so you must forgive my question if it 
is amateurish. I asked it at the Lucene Java user mailing list and Grant 
Ingersoll referred me to this list.


After
some reading of Luke's source code, I found to my dismay that obtaining
the TermFreqVector of a document via the IndexReader resulted in no
vectors at all. A mailing list entry found via Google said that Nutch does not 
store the contents of a page in its Lucene indices. This makes sense.

I
then read the Nutch source code and figured out that one could use
NutchBean to reconstruct the parsed text of an indexed page.

However,
this still left the nagging problem of retrieving the TermFreqVector
for the parsed text of a page. I tried MoreLikeThis to retrieve the set
of terms but that did not work either; it was simply empty. The source
code to MoreLikeThis suggests certain assumptions made on the Lucene
indices being accessed.

At the end of the day, I simply decided to reconstruct the term frequency 
vector of a page by referring to TermDocs in the IndexReader. This is
not very efficient since I have to do this for every page iterated over
the Lucene document index.

I wonder whether it is possible to
retrieve previously computed TermFreqVector[] of a document in Nutch's
Lucene indices? Surely the term frequency vectors must be somewhere because 
Nutch makes use of TF-IDF to compute the score of a page for a given query. 
Your insights on the matter will
help.

House

Retrieving the term vectors of a document in Nutch

Reply via email to