Re: Using a TermFreqVector to get counts of all words in a document

Grant Ingersoll Thu, 21 Oct 2010 03:51:43 -0700

On Oct 20, 2010, at 4:40 PM, Martin O'Shea wrote:

> http://mail-archives.apache.org/mod_mbox/lucene-java-user/201010.mbox/%3c128
> 7065863.4cb7110774...@netmail.pipex.net%3e will give you a better idea of
> what I'm moving towards.
> 
> It's all a bit grey at the moment so further investigation is inevitable.
> 
> I expect that a combination of MySQL database storage and Lucene indexing is
> going to be the end result.


I'd likely take the TermVectorMapper approach, but otherwise, yeah, I think you 
are on the right track.


> 
> 
> 
> -----Original Message-----
> From: Grant Ingersoll [mailto:gsing...@apache.org] 
> Sent: 20 Oct 2010 21 20
> To: java-user@lucene.apache.org
> Subject: Re: Using a TermFreqVector to get counts of all words in a document
> 
> 
> On Oct 20, 2010, at 2:53 PM, Martin O'Shea wrote:
> 
>> Uwe
>> 
>> Thanks - I figured that bit out. I'm a Lucene 'newbie'.
>> 
>> What I would like to know though is if it is practical to search a single
>> document of one field simply by doing this:
>> 
>> IndexReader trd = IndexReader.open(index);
>>       TermFreqVector tfv = trd.getTermFreqVector(docId, "title");
>>       String[] terms = tfv.getTerms();
>>       int[] freqs = tfv.getTermFrequencies();
>>       for (int i = 0; i < tfv.getTerms().length; i++) {
>>           System.out.println("Term " + terms[i] + " Freq: " + freqs[i]);
>>       }
>>       trd.close();
>> 
>> where docId is set to 0.
>> 
>> The code works but can this be improved upon at all?
>> 
>> My situation is where I don't want to calculate the number of documents
> with
>> a particular string. Rather I want to get counts of individual words in a
>> field in a document. So I can concatenate the strings before passing it to
>> Lucene.
> 
> Can you describe the bigger problem you are trying to solve?  This looks
> like a classic XY problem: http://people.apache.org/~hossman/#xyproblem
> 
> What you are doing above will work OK for what you describe (up to the
> "passing it to Lucene" part), but you probably should explore the use of the
> TermVectorMapper which provides a callback mechanism (similar to a SAX
> parser) that will allow you to build your data structures on the fly instead
> of having to serialize them into two parallel arrays and then loop over
> those arrays to create some other structure.
> 
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Using a TermFreqVector to get counts of all words in a document

Reply via email to