Retrieving the term vectors of a document in Nutch

House Less Sun, 07 Jun 2009 17:14:52 -0700

Hello everyone,

I am quite new to development with Nutch, so you must forgive my question if it 
is amateurish.


After some reading of Luke's source code, I found to my dismay that obtaining 
the TermFreqVector of a document via the IndexReader resulted in no vectors at 
all. A mailing list entry found via Google said that Nutch does not store the 
contents of a page in its Lucene indices. This makes sense.

I then read the Nutch source code and figured out that one could use NutchBean 
to reconstruct the parsed text of an indexed page.

However, this still left the nagging problem of retrieving the TermFreqVector 
for the parsed text of a page. I tried MoreLikeThis to retrieve the set of 
terms but that did not work either; it was simply empty. The source code to 
MoreLikeThis suggests certain assumptions made on the Lucene indices being 
accessed.

At the end of the day, I simply decided to reconstruct the term frequency 
vector of a page by referring to TermDocs in the IndexReader. This is not very 
efficient since I have to do this for every page iterated over the Lucene 
document index.

I wonder whether it is possible to retrieve previously computed 
TermFreqVector[] of a document in Nutch's Lucene indices? Could it be that this 
is not possible because Nutch does not store the TermFreqVector[]? Your 
insights on the matter will help.

House



      


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Retrieving the term vectors of a document in Nutch

Reply via email to