House Less wrote:
> 
> 
> Hello everyone,
> 
> I am quite new to development with Nutch, so you must forgive my question
> if it is amateurish. I asked it at the Lucene Java user mailing list and
> Grant Ingersoll referred me to this list.
> 
> After
> some reading of Luke's source code, I found to my dismay that obtaining
> the TermFreqVector of a document via the IndexReader resulted in no
> vectors at all. A mailing list entry found via Google said that Nutch does
> not store the contents of a page in its Lucene indices. This makes sense.
> 
> I
> then read the Nutch source code and figured out that one could use
> NutchBean to reconstruct the parsed text of an indexed page.
> 
> However,
> this still left the nagging problem of retrieving the TermFreqVector
> for the parsed text of a page. I tried MoreLikeThis to retrieve the set
> of terms but that did not work either; it was simply empty. The source
> code to MoreLikeThis suggests certain assumptions made on the Lucene
> indices being accessed.
> 
> At the end of the day, I simply decided to reconstruct the term frequency
> vector of a page by referring to TermDocs in the IndexReader. This is
> not very efficient since I have to do this for every page iterated over
> the Lucene document index.
> 
> I wonder whether it is possible to
> retrieve previously computed TermFreqVector[] of a document in Nutch's
> Lucene indices? Surely the term frequency vectors must be somewhere
> because Nutch makes use of TF-IDF to compute the score of a page for a
> given query. Your insights on the matter will
> help.
> 
> House
> 


Hello everyone first :)
I know post is almost 10 months old but I've got the same problem or issue
:) Didn't find any other answers. Any news or updates about why
TermFreqVector is empty (null) or how to get this data other easy and
efficient way? Very important from me is to get information about value of
specyfic indexes per document/page I'm currently processing. I'm writing
search engine and want to get information how many of result pages ( result
for search query that includex index_1) have got value of index_2 equal to
VAL_1 how many VAL2 and so on... thought i could do it with TermFreqVector.


Regards
Jacob
-- 
View this message in context: 
http://n3.nabble.com/Retrieving-the-term-vectors-of-a-document-in-Nutch-tp618341p739372.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to